Hi, I''ve been looking at Btrfs and have a couple of naive questions that don''t seem to be answered on the wiki or in the articles I''ve read on the filesystem. First: discovering a file''s checksum value. Here''s the scenario: software is writing some data as a fresh file. This software happens to know (a priori) the checksum of this data; for example, a storage server receives the file''s data and checksum independently. I''ve some confidence that, once the data is stored in btrfs, any corruption (from the storage fabric) will be spotted; however, the data may have became corrupt before being stored (e.g., from the network). To catch this, the checksum of the stored data needs to be calculated and checked. One approach is to calculate the checksum (in user-space) after the data is stored. This adds extra IO- and CPU-load and there''s also the possibility of false-negative results due to the filesystem cache (although btrfs may remove this risk). Another approach would be to ask btrfs for the checksum. It seems that it''s possible to combine multiple CRC-32C values to figure out the checksum of the combined data [e.g., zlib''s crc32_combine() function]. So, obtaining a file''s checksum might be a light-weight operation. Yet another possibility would be to push the desired checksum value (via fcntl?) and have btrfs compare the desired checksum with the file''s actual checksum on close(2), failing that call if the checksums don''t match. Would any of this be possible (without an awful lot of work)? Second: adding support for Adler32? Looking at the unstable git repo, it looks like there''s currently support for only the CRC-32C checksum algorithm. Is this correct? If so, is anyone working on adding support for Adler32? Cheers, Paul. (ps, please keep me CC-ed in on replies) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 27 May 2010 15:39:54 Paul Millar wrote:> Hi, > > I''ve been looking at Btrfs and have a couple of naive questions that don''t > seem to be answered on the wiki or in the articles I''ve read on the > filesystem. > > > First: discovering a file''s checksum value. > > Here''s the scenario: software is writing some data as a fresh file. This > software happens to know (a priori) the checksum of this data; for example, > a storage server receives the file''s data and checksum independently. > > I''ve some confidence that, once the data is stored in btrfs, any corruption > (from the storage fabric) will be spotted; however, the data may have > became corrupt before being stored (e.g., from the network). To catch > this, the checksum of the stored data needs to be calculated and checked. > > One approach is to calculate the checksum (in user-space) after the data is > stored. This adds extra IO- and CPU-load and there''s also the possibility > of false-negative results due to the filesystem cache (although btrfs may > remove this risk). > > Another approach would be to ask btrfs for the checksum. It seems that > it''s possible to combine multiple CRC-32C values to figure out the > checksum of the combined data [e.g., zlib''s crc32_combine() function]. > So, obtaining a file''s checksum might be a light-weight operation. > > Yet another possibility would be to push the desired checksum value (via > fcntl?) and have btrfs compare the desired checksum with the file''s actual > checksum on close(2), failing that call if the checksums don''t match. > > Would any of this be possible (without an awful lot of work)?IMO, if an application recieves data with checksum it can calculate the checksum of data on the fly, as it writes it to the disk. It won''t add any additional IO to storage subsystem. It won''t detect in-memory corruption though, but if you want to be resilant to this, you should be looking at ECC RAM as subsequent checks can be affected by it to. Second, you shouldn''t tie application or network protocol to a CRC scheme used by filesystem on server! Especially when there can be other CRC algorithms used, not only CRC-32C. If the checksum algorithm used by FS was set in stone, then userspace could employ it somehow, but if there can be different CRCs used, I see no reason to allow the userspace to read them. -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl System Zarządzania Jakością zgodny z normą ISO 9001:2000 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 27, 2010 at 03:39:54PM +0200, Paul Millar wrote:> Hi, > > I''ve been looking at Btrfs and have a couple of naive questions that don''t > seem to be answered on the wiki or in the articles I''ve read on the > filesystem. > > > First: discovering a file''s checksum value. > > Here''s the scenario: software is writing some data as a fresh file. This > software happens to know (a priori) the checksum of this data; for example, a > storage server receives the file''s data and checksum independently. > > I''ve some confidence that, once the data is stored in btrfs, any corruption > (from the storage fabric) will be spotted; however, the data may have became > corrupt before being stored (e.g., from the network). To catch this, the > checksum of the stored data needs to be calculated and checked. > > One approach is to calculate the checksum (in user-space) after the data is > stored. This adds extra IO- and CPU-load and there''s also the possibility of > false-negative results due to the filesystem cache (although btrfs may remove > this risk). > > Another approach would be to ask btrfs for the checksum. It seems that it''s > possible to combine multiple CRC-32C values to figure out the checksum of the > combined data [e.g., zlib''s crc32_combine() function]. So, obtaining a file''s > checksum might be a light-weight operation. > > Yet another possibility would be to push the desired checksum value (via > fcntl?) and have btrfs compare the desired checksum with the file''s actual > checksum on close(2), failing that call if the checksums don''t match. > > Would any of this be possible (without an awful lot of work)?I''d suggest that you look at T10 DIF and DIX, which are targeted at exactly this kind of thing. We''re looking at integrating dif/dix into btrfs at some point.> > > > Second: adding support for Adler32? > > Looking at the unstable git repo, it looks like there''s currently support for > only the CRC-32C checksum algorithm. Is this correct? If so, is anyone > working on adding support for Adler32?We haven''t looked at adler32. crc32c was chosen because it is supported in hardware by recent intel CPUs. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Hubert, On Thursday 27 May 2010 16:56:00 Hubert Kario wrote:> > Would [obtaining file checksum] be possible (without an awful lot > > of work)? > > [Calculating checksum in-memory] won''t detect in-memory corruption > though, but if you want to be resilant to this, you should be looking at > ECC RAM as subsequent checks can be affected by it to.Certainly ECC RAM will help, but unfortunately it doesn''t remove the possibility of corruption; for example, CERN found [1] that double-bit memory corruptions (which ECC cannot recover from) can still happen. [1] http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797 Also, IIRC there was a case where Fermilab tracked down a data corruption to a faulty PCI bus in the server. So who knows where are all the places corruption could occur? I guess the real problem is that, when processing large amounts of data, these rare occurrences start to stack up.> Second, you shouldn''t tie application or network protocol to a CRC scheme > used by filesystem on server! Especially when there can be other CRC > algorithms used, not only CRC-32C.Sure, but the protocol isn''t tied to any particular checksum algorithm.> If the checksum algorithm used by FS was set in stone, then userspace could > employ it somehow, but if there can be different CRCs used, I see no reason > to allow the userspace to read them.I agree that a checksum value, without knowing the algorithm, isn''t much use. However, the FS reported a string representation of the tuple (algorithm, value); for example: 0:DCD05C54 (where "0" is from BTRFS_CSUM_TYPE_CRC32) Would that allow meaningful use of this information? Cheers, Paul. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Chris, On Thursday 27 May 2010 18:00:44 Chris Mason wrote:> I''d suggest that you look at T10 DIF and DIX, which are targeted at > exactly this kind of thing. We''re looking at integrating dif/dix into > btrfs at some point.I''ve been keeping half-an-eye on T10''s work in ensuring end-to-end integrity. That you guys are planning to integrate dif/dix support is certainly welcome news! In my use-case (a file-server that receives a new file from a remote client), I believe that, to ensure end-to-end integrity, the server software would have to push the client-supplied checksum into the FS when writing a new file. (I believe there''s some T10 slides somewhere that show this use-case) -- or (equivalently) the server software obtains the FS checksum for the file and matches it against the client-supplied value. I''m deliberately taking the simplest case when the client has chosen the same checksum algorithm as the FS uses. In reality, this may not be the case, but we can probably cope with that. My concern is that, if the server-software doesn''t push the client-provided checksum then the FS checksum (plus T-10 DIF/DIX) would not provide a rigorous assurance that the bytes are the same. Without this assurance, corruption could still occur; for example, within the server''s memory.> We haven''t looked at adler32. crc32c was chosen because it is supported > in hardware by recent intel CPUs.OK, fair enough :) Cheers, Paul. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 31, 2010 at 11:06 AM, Paul Millar <paul.millar@desy.de> wrote:> Hi Chris, > > On Thursday 27 May 2010 18:00:44 Chris Mason wrote: >> I''d suggest that you look at T10 DIF and DIX, which are targeted at >> exactly this kind of thing. We''re looking at integrating dif/dix into >> btrfs at some point. > > I''ve been keeping half-an-eye on T10''s work in ensuring end-to-end integrity. > That you guys are planning to integrate dif/dix support is certainly welcome > news! > > In my use-case (a file-server that receives a new file from a remote client), > I believe that, to ensure end-to-end integrity, the server software would > have to push the client-supplied checksum into the FS when writing a new file. > (I believe there''s some T10 slides somewhere that show this use-case) -- or > (equivalently) the server software obtains the FS checksum for the file and > matches it against the client-supplied value. > > I''m deliberately taking the simplest case when the client has chosen the same > checksum algorithm as the FS uses. In reality, this may not be the case, but > we can probably cope with that. > > My concern is that, if the server-software doesn''t push the client-provided > checksum then the FS checksum (plus T-10 DIF/DIX) would not provide a rigorous > assurance that the bytes are the same. Without this assurance, corruption > could still occur; for example, within the server''s memory. >Have you taken into account the boundaries of the data checksums? Your app may checksum per file or some logical partition in the file format. Btrfs does the checksum per-extent so unless you keep track of where the extent boundaries are, that checksum will be useless to the userspace app. Also the app would be tied specifically to a storage technology. No matter how great foo might be, not everyone''s going to use it. Also are you going to get this info over nfs, cifs, lustre, gluster, ceph, foo, bar and baz? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Paul" == Paul Millar <paul.millar@desy.de> writes:Paul> My concern is that, if the server-software doesn''t push the Paul> client-provided checksum then the FS checksum (plus T-10 DIF/DIX) Paul> would not provide a rigorous assurance that the bytes are the Paul> same. Without this assurance, corruption could still occur; for Paul> example, within the server''s memory. For DIX we allow integrity metadata conversion. Once the data is received, the server generates appropriate IMD for the next layer. Then the server verifies that the original IMD matches the data buffer. That way there''s no window of error. But obviously the ideal case is where the same IMD can be passed throughout the stack without conversion. Not sure what you use for file service? I believe NFSv4 allows for checksums to be passed along. I have not looked at them closely yet, though. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Mike, On Monday 31 May 2010 22:33:23 Mike Fedyk wrote:> On Mon, May 31, 2010 at 11:06 AM, Paul Millar <paul.millar@desy.de> wrote: > > [...] My concern is that, if the server-software doesn''t push the > > client-provided checksum then the FS checksum (plus T-10 DIF/DIX) would > > not provide a rigorous assurance that the bytes are the same [...] > > Have you taken into account the boundaries of the data checksums? > Your app may checksum per file or some logical partition in the file > format.I''m thinking specifically of the case when the user creates a file, writes the file''s contents and closes it; for us, this is the only use-case when writing data. In this scenario, the checksum would be of the file''s complete data rather than any particular logical partition.> Btrfs does the checksum per-extent so unless you keep track > of where the extent boundaries are, that checksum will be useless to > the userspace app.Sure, this is true with how things are currently. However, I was hoping that it would be possible to add code within btrfs to obtain the checksum over the all the file''s data. Since btrfs knows the extend sizes and per-extend checksum values, I believe this is tractable and relatively easy.> Also the app would be tied specifically to a storage technology. No > matter how great foo might be, not everyone''s going to use it.Sure, but I''m thinking of this behaviour (within the app) as being optional. The app would continue to be FS and storage-technology independent. If the FS doesn''t support internal consistency (e.g., ext3, xfs, ..) then the app would continue to do userland checksum verification on write: it''s better than nothing. If the app is deployed on a node with btrfs then the app could try to "align" the user-supplied checksum with the value within the FS: either pushing the correct checksum value into the FS or reading the resulting FS-generated checksum value after writing and comparing it with the user-supplied value.> Also are you going to get this info over nfs, cifs, lustre, gluster, > ceph, foo, bar and baz?This is certainly a valid concern. I can''t speak for all these protocols and distributed filesystems: we don''t support mounting our app with CIFS and the software doesn''t participate with luster, gluster, ceph cluster filesystems. However, here''s information about the protocols we do support: The majority of LAN transfers use a custom protocol. The wire-protocol includes support for uploading a checksum value on close. We also support the xrootd protocol, which allows clients to upload checksum values with the kXR_verifyw command. We''ve also support for NFS v4.1. NFS doesn''t support uploading checksum (I believe, and it isn''t part of current v4.2 work), but we may be able to work around this. We also support WebDAV. This currently has no support for checksum. Almost all WAN transfers currently use GridFTP v2. This includes the SCKS command, which allows the client to upload the correct checksum value. In short, with current usage, the app will know the checksum value, as supplied by the remote client. Cheers, Paul. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 01 June 2010 15:39:52 Martin K. Petersen wrote:> >>>>> "Paul" == Paul Millar <paul.millar@desy.de> writes: > Paul> My concern is that, if the server-software doesn''t push the > Paul> client-provided checksum then the FS checksum (plus T-10 DIF/DIX) > Paul> would not provide a rigorous assurance that the bytes are the > Paul> same. Without this assurance, corruption could still occur; for > Paul> example, within the server''s memory. > > For DIX we allow integrity metadata conversion. Once the data is > received, the server generates appropriate IMD for the next layer. Then > the server verifies that the original IMD matches the data buffer. That > way there''s no window of error. But obviously the ideal case is where > the same IMD can be passed throughout the stack without conversion.I think we may be talking slightly at cross-purposes here: in my case, one of the end-points (for "end-to-end data integrity") is a remote computer, that is uploading a file with a corresponding checksum. Please correct me if I''m wrong here, but T10 DIF/DIX refers only to data integrity protection from the OS''s FS-level down to the block device: a userland application doesn''t know that it is writing into a FS that is utilising DIX with a DIF-enabled storage system. When a file is uploaded from a remote client to an application with the checksum, the app can verify this checksum internally. However, there''s then a (logical) gap between userland and FS where data integrity is no longer assured. For example, corruption that occurs after the app has verified the checksum value would not be picked up, even with T10 DIX/DIF, since the FS would receive and store the already-corrupted data "in good faith". In principle, one can add a btrfs-specific mechanism to continue this assurance from userland down to the FS. Perhaps the simplest would be to allow userland applications to read the FS''s internal checksum (app would read the FS internal checksum after writing and verify it is consistent), but I guess more sophisticated (interleaved IMD, T10-like) mechanisms are also possible. Unfortunately, any such solution would be btrfs-specific, since (I believe) no one has standardised how to extend T10 into userspace.> Not sure what you use for file service? I believe NFSv4 allows for > checksums to be passed along. I have not looked at them closely yet, > though.I believe NFS currently doesn''t support checksums (as per v4.1). Looking into more detail, Alok Aggarwal gave a talk at 2006 connectathon about this. Alok''s slides have a nice diagram (slide 11) showing the kind of end-to-end integrity I''m after. The issue is how to achieve the assurance between "NFS Server" and "Local FS" on the right. For NFS, I believe there aren''t any plans for introducing checksum support for v4.2. Perhaps it''ll appear with the later minor versions of the standard. Cheers, Paul. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday 31 May 2010 19:59:46 Paul Millar wrote:> Hi Hubert, > > On Thursday 27 May 2010 16:56:00 Hubert Kario wrote: > > > Would [obtaining file checksum] be possible (without an awful lot > > > of work)? > > > > [Calculating checksum in-memory] won''t detect in-memory corruption > > though, but if you want to be resilant to this, you should be looking at > > > > ECC RAM as subsequent checks can be affected by it to. > > Certainly ECC RAM will help, but unfortunately it doesn''t remove the > possibility of corruption; for example, CERN found [1] that double-bit > memory corruptions (which ECC cannot recover from) can still happen. > > [1] > http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&mat > erialId=paper&confId=13797 > > Also, IIRC there was a case where Fermilab tracked down a data corruption > to a faulty PCI bus in the server. So who knows where are all the places > corruption could occur? > > I guess the real problem is that, when processing large amounts of data, > these rare occurrences start to stack up. >Yes, but AFAIK btrfs checksums don''t have internal checksum (e.g. you can''t check if the read checksum is a valid one or not, it does not have control bits), as such, if you consider PCI bus corruption as likely, you still don''t get 100% certanity that the data reached the HDD unharmed. If you need such level of certanity when recording data, I''d consider mainframe hardware and/or duplicating whole storage stack. Cheers, -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl System Zarządzania Jakością zgodny z normą ISO 9001:2000 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Paul" == Paul Millar <paul.millar@desy.de> writes:Paul> Please correct me if I''m wrong here, but T10 DIF/DIX refers only Paul> to data integrity protection from the OS''s FS-level down to the Paul> block device: a userland application doesn''t know that it is Paul> writing into a FS that is utilising DIX with a DIF-enabled storage Paul> system. My point was that it is possible to have different protection types in play (and thus different checksums) as long as you overlap the protection envelopes. At the expense of having to calculate checksums multiple times, of course. Paul> Unfortunately, any such solution would be btrfs-specific, since (I Paul> believe) no one has standardised how to extend T10 into userspace. Not yet, but we''re working on a generic interface that would allow the protection information to be attached. This is not going to be tied to just T10 DIF. The current Linux block layer integrity handles different types of protection information. Paul> I believe NFS currently doesn''t support checksums (as per v4.1). Paul> Looking into more detail, Alok Aggarwal gave a talk at 2006 Paul> connectathon about this. Alok''s slides have a nice diagram (slide Paul> 11) showing the kind of end-to-end integrity I''m after. The issue Paul> is how to achieve the assurance between "NFS Server" and "Local Paul> FS" on the right. Paul> For NFS, I believe there aren''t any plans for introducing checksum Paul> support for v4.2. Perhaps it''ll appear with the later minor Paul> versions of the standard. I haven''t looked into this for a long time. Last time I talked to the NFS folks they seemed to think it would be possible to bridge the two methods. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html