thr3ads.net - Btrfs devel - A couple of questions [May 2010]

If this information is useful, please help other people find it:
Share via:

Paul Millar

2010-May-27 13:39 UTC

A couple of questions

Hi,

I''ve been looking at Btrfs and have a couple of naive questions that
don''t
seem to be answered on the wiki or in the articles I''ve read on the 
filesystem.


First: discovering a file''s checksum value.

Here''s the scenario: software is writing some data as a fresh file. 
This
software happens to know (a priori) the checksum of this data; for example, a 
storage server receives the file''s data and checksum independently.

I''ve some confidence that, once the data is stored in btrfs, any
corruption
(from the storage fabric) will be spotted; however, the data may have became 
corrupt before being stored (e.g., from the network).  To catch this, the 
checksum of the stored data needs to be calculated and checked.

One approach is to calculate the checksum (in user-space) after the data is 
stored.  This adds extra IO- and CPU-load and there''s also the
possibility of
false-negative results due to the filesystem cache (although btrfs may remove 
this risk).

Another approach would be to ask btrfs for the checksum.  It seems that
it''s
possible to combine multiple CRC-32C values to figure out the checksum of the 
combined data [e.g., zlib''s crc32_combine() function].  So, obtaining a
file''s
checksum might be a light-weight operation.

Yet another possibility would be to push the desired checksum value (via 
fcntl?) and have btrfs compare the desired checksum with the file''s
actual
checksum on close(2), failing that call if the checksums don''t match.

Would any of this be possible (without an awful lot of work)?



Second: adding support for Adler32?

Looking at the unstable git repo, it looks like there''s currently
support for
only the CRC-32C checksum algorithm.  Is this correct?  If so, is anyone 
working on adding support for Adler32?

Cheers,

Paul.
(ps, please keep me CC-ed in on replies)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2010-May-27 14:56 UTC

head link

Re: A couple of questions

On Thursday 27 May 2010 15:39:54 Paul Millar wrote:> Hi,
> 
> I''ve been looking at Btrfs and have a couple of naive questions
that don''t
> seem to be answered on the wiki or in the articles I''ve read on
the
> filesystem.
> 
> 
> First: discovering a file''s checksum value.
> 
> Here''s the scenario: software is writing some data as a fresh
file.  This
> software happens to know (a priori) the checksum of this data; for example,
> a storage server receives the file''s data and checksum
independently.
> 
> I''ve some confidence that, once the data is stored in btrfs, any
corruption
> (from the storage fabric) will be spotted; however, the data may have
> became corrupt before being stored (e.g., from the network).  To catch
> this, the checksum of the stored data needs to be calculated and checked.
> 
> One approach is to calculate the checksum (in user-space) after the data is
> stored.  This adds extra IO- and CPU-load and there''s also the
possibility
> of false-negative results due to the filesystem cache (although btrfs may
> remove this risk).
> 
> Another approach would be to ask btrfs for the checksum.  It seems that
> it''s possible to combine multiple CRC-32C values to figure out the
> checksum of the combined data [e.g., zlib''s crc32_combine()
function].
> So, obtaining a file''s checksum might be a light-weight operation.
> 
> Yet another possibility would be to push the desired checksum value (via
> fcntl?) and have btrfs compare the desired checksum with the
file''s actual
> checksum on close(2), failing that call if the checksums don''t
match.
> 
> Would any of this be possible (without an awful lot of work)?
IMO, if an application recieves data with checksum it can calculate the 
checksum of data on the fly, as it writes it to the disk. It won''t add
any
additional IO to storage subsystem. It won''t detect in-memory
corruption
though, but if you want to be resilant to this, you should be looking at ECC 
RAM as subsequent checks can be affected by it to.

Second, you shouldn''t tie application or network protocol to a CRC
scheme used
by filesystem on server! Especially when there can be other CRC algorithms 
used, not only CRC-32C.

If the checksum algorithm used by FS was set in stone, then userspace could 
employ it somehow, but if there can be different CRCs used, I see no reason to 
allow the userspace to read them.


-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

System Zarządzania Jakością
zgodny z normą ISO 9001:2000
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-May-27 16:00 UTC

head link

Re: A couple of questions

On Thu, May 27, 2010 at 03:39:54PM +0200, Paul Millar
wrote:> Hi,
> 
> I''ve been looking at Btrfs and have a couple of naive questions
that don''t
> seem to be answered on the wiki or in the articles I''ve read on
the
> filesystem.
> 
> 
> First: discovering a file''s checksum value.
> 
> Here''s the scenario: software is writing some data as a fresh
file.  This
> software happens to know (a priori) the checksum of this data; for example,
a
> storage server receives the file''s data and checksum
independently.
> 
> I''ve some confidence that, once the data is stored in btrfs, any
corruption
> (from the storage fabric) will be spotted; however, the data may have
became
> corrupt before being stored (e.g., from the network).  To catch this, the 
> checksum of the stored data needs to be calculated and checked.
> 
> One approach is to calculate the checksum (in user-space) after the data is
> stored.  This adds extra IO- and CPU-load and there''s also the
possibility of
> false-negative results due to the filesystem cache (although btrfs may
remove
> this risk).
> 
> Another approach would be to ask btrfs for the checksum.  It seems that
it''s
> possible to combine multiple CRC-32C values to figure out the checksum of
the
> combined data [e.g., zlib''s crc32_combine() function].  So,
obtaining a file''s
> checksum might be a light-weight operation.
> 
> Yet another possibility would be to push the desired checksum value (via 
> fcntl?) and have btrfs compare the desired checksum with the
file''s actual
> checksum on close(2), failing that call if the checksums don''t
match.
> 
> Would any of this be possible (without an awful lot of work)?
I''d suggest that you look at T10 DIF and DIX, which are targeted at
exactly this kind of thing.  We''re looking at integrating dif/dix into
btrfs at some point.
> 
> 
> 
> Second: adding support for Adler32?
> 
> Looking at the unstable git repo, it looks like there''s currently
support for
> only the CRC-32C checksum algorithm.  Is this correct?  If so, is anyone 
> working on adding support for Adler32?
We haven''t looked at adler32.  crc32c was chosen because it is
supported
in hardware by recent intel CPUs.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul Millar

2010-May-31 17:59 UTC

head link

Re: A couple of questions

Hi Hubert,

On Thursday 27 May 2010 16:56:00 Hubert Kario wrote:> > Would [obtaining file checksum] be possible (without an awful lot
> > of work)?
> 
> [Calculating checksum in-memory]  won''t detect in-memory
corruption
> though, but if you want to be resilant to this, you should be looking at
>  ECC RAM as subsequent checks can be affected by it to.
Certainly ECC RAM will help, but unfortunately it doesn''t remove the 
possibility of corruption; for example, CERN found [1] that double-bit memory 
corruptions (which ECC cannot recover from) can still happen.

[1] 
http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797

Also, IIRC there was a case where Fermilab tracked down a data corruption to a 
faulty PCI bus in the server.  So who knows where are all the places 
corruption could occur?

I guess the real problem is that, when processing large amounts of data, these 
rare occurrences start to stack up.

> Second, you shouldn''t tie application or network protocol to a CRC
scheme
>  used by filesystem on server! Especially when there can be other CRC
>  algorithms used, not only CRC-32C.
Sure, but the protocol isn''t tied to any particular checksum algorithm.

 > If the checksum algorithm used by FS was set in stone, then userspace could
> employ it somehow, but if there can be different CRCs used, I see no reason
>  to allow the userspace to read them.
I agree that a checksum value, without knowing the algorithm, isn''t
much use.
However, the FS reported a string representation of the tuple (algorithm, 
value); for example:

   0:DCD05C54

(where "0" is from BTRFS_CSUM_TYPE_CRC32)

Would that allow meaningful use of this information?

Cheers,

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul Millar

2010-May-31 18:06 UTC

head link

Re: A couple of questions

Hi Chris,

On Thursday 27 May 2010 18:00:44 Chris Mason wrote:> I''d suggest that you look at T10 DIF and DIX, which are targeted
at
> exactly this kind of thing.  We''re looking at integrating dif/dix
into
> btrfs at some point.
I''ve been keeping half-an-eye on T10''s work in ensuring
end-to-end integrity.
That you guys are planning to integrate dif/dix support is certainly welcome 
news!

In my use-case (a file-server that receives a new file from a remote client),  
I believe that, to ensure end-to-end integrity,  the server software would 
have to push the client-supplied checksum into the FS when writing a new file.  
(I believe there''s some T10 slides somewhere that show this use-case)
-- or
(equivalently) the server software obtains the FS checksum for the file and 
matches it against the client-supplied value.

I''m deliberately taking the simplest case when the client has chosen
the same
checksum algorithm as the FS uses.  In reality, this may not be the case, but 
we can probably cope with that.

My concern is that, if the server-software doesn''t push the
client-provided
checksum then the FS checksum (plus T-10 DIF/DIX) would not provide a rigorous 
assurance that the bytes are the same.  Without this assurance, corruption 
could still occur; for example, within the server''s memory.
> We haven''t looked at adler32.  crc32c was chosen because it is
supported
> in hardware by recent intel CPUs.
OK, fair enough :)

Cheers,

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Fedyk

2010-May-31 20:33 UTC

head link

Re: A couple of questions

On Mon, May 31, 2010 at 11:06 AM, Paul Millar <paul.millar@desy.de>
wrote:> Hi Chris,
>
> On Thursday 27 May 2010 18:00:44 Chris Mason wrote:
>> I''d suggest that you look at T10 DIF and DIX, which are
targeted at
>> exactly this kind of thing.  We''re looking at integrating
dif/dix into
>> btrfs at some point.
>
> I''ve been keeping half-an-eye on T10''s work in ensuring
end-to-end integrity.
> That you guys are planning to integrate dif/dix support is certainly
welcome
> news!
>
> In my use-case (a file-server that receives a new file from a remote
client),
> I believe that, to ensure end-to-end integrity,  the server software would
> have to push the client-supplied checksum into the FS when writing a new
file.
> (I believe there''s some T10 slides somewhere that show this
use-case) -- or
> (equivalently) the server software obtains the FS checksum for the file and
> matches it against the client-supplied value.
>
> I''m deliberately taking the simplest case when the client has
chosen the same
> checksum algorithm as the FS uses.  In reality, this may not be the case,
but
> we can probably cope with that.
>
> My concern is that, if the server-software doesn''t push the
client-provided
> checksum then the FS checksum (plus T-10 DIF/DIX) would not provide a
rigorous
> assurance that the bytes are the same.  Without this assurance, corruption
> could still occur; for example, within the server''s memory.
>
Have you taken into account the boundaries of the data checksums?
Your app may checksum per file or some logical partition in the file
format.  Btrfs does the checksum per-extent so unless you keep track
of where the extent boundaries are, that checksum will be useless to
the userspace app.  Also the app would be tied specifically to a
storage technology.  No matter how great foo might be, not everyone''s
going to use it.

Also are you going to get this info over nfs, cifs, lustre, gluster,
ceph, foo, bar and baz?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin K. Petersen

2010-Jun-01 13:39 UTC

head link

Re: A couple of questions

>>>>> "Paul" == Paul Millar <paul.millar@desy.de>
writes:
Paul> My concern is that, if the server-software doesn''t push the
Paul> client-provided checksum then the FS checksum (plus T-10 DIF/DIX)
Paul> would not provide a rigorous assurance that the bytes are the
Paul> same.  Without this assurance, corruption could still occur; for
Paul> example, within the server''s memory.

For DIX we allow integrity metadata conversion.  Once the data is
received, the server generates appropriate IMD for the next layer.  Then
the server verifies that the original IMD matches the data buffer.  That
way there''s no window of error.  But obviously the ideal case is where
the same IMD can be passed throughout the stack without conversion.

Not sure what you use for file service?  I believe NFSv4 allows for
checksums to be passed along. I have not looked at them closely yet,
though.

-- 
Martin K. Petersen	Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul Millar

2010-Jun-02 11:56 UTC

head link

Re: A couple of questions

Hi Mike,

On Monday 31 May 2010 22:33:23 Mike Fedyk wrote:> On Mon, May 31, 2010 at 11:06 AM, Paul Millar <paul.millar@desy.de>
wrote:
> > [...] My concern is that, if the server-software doesn''t push
the
> > client-provided checksum then the FS checksum (plus T-10 DIF/DIX)
would
> > not provide a rigorous assurance that the bytes are the same [...]
> 
> Have you taken into account the boundaries of the data checksums?
> Your app may checksum per file or some logical partition in the file
> format. 
I''m thinking specifically of the case when the user creates a file,
writes the
file''s contents and closes it;  for us, this is the only use-case when
writing
data.  In this scenario, the checksum would be of the file''s complete
data
rather than any particular logical partition.
> Btrfs does the checksum per-extent so unless you keep track
> of where the extent boundaries are, that checksum will be useless to
> the userspace app. 
Sure, this is true with how things are currently.

However, I was hoping that it would be possible to add code within btrfs to 
obtain the checksum over the all the file''s data.  Since btrfs knows
the
extend sizes and per-extend checksum values, I believe this is tractable and 
relatively easy.
> Also the app would be tied specifically to a storage technology.  No
> matter how great foo might be, not everyone''s going to use it.
Sure, but I''m thinking of this behaviour (within the app) as being
optional.
The app would continue to be FS and storage-technology independent.

If the FS doesn''t support internal consistency (e.g., ext3, xfs, ..)
then the
app would continue to do userland checksum verification on write:  it''s
better
than nothing.

If the app is deployed on a node with btrfs then the app could try to
"align"
the user-supplied checksum with the value within the FS: either pushing the 
correct checksum value into the FS or reading the resulting FS-generated 
checksum value after writing and comparing it with the user-supplied value.
> Also are you going to get this info over nfs, cifs, lustre, gluster,
> ceph, foo, bar and baz?
This is certainly a valid concern. 

I can''t speak for all these protocols and distributed filesystems: we
don''t
support mounting our app with CIFS and the software doesn''t participate
with
luster, gluster, ceph cluster filesystems.

However, here''s information about the protocols we do support:

The majority of LAN transfers use a custom protocol.  The wire-protocol 
includes support for uploading a checksum value on close.

We also support the xrootd protocol, which allows clients to upload checksum 
values with the kXR_verifyw command.

We''ve also support for NFS v4.1.   NFS doesn''t support
uploading checksum (I
believe, and it isn''t part of current v4.2 work), but we may be able to
work
around this.

We also support WebDAV.  This currently has no support for checksum.

Almost all WAN transfers currently use GridFTP v2.  This includes the SCKS 
command, which allows the client to upload the correct checksum value.

In short, with current usage, the app will know the checksum value, as 
supplied by the remote client.

Cheers,

Paul.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul Millar

2010-Jun-02 13:40 UTC

head link

Re: A couple of questions

On Tuesday 01 June 2010 15:39:52 Martin K. Petersen
wrote:> >>>>> "Paul" == Paul Millar
<paul.millar@desy.de> writes:
> Paul> My concern is that, if the server-software doesn''t push
the
> Paul> client-provided checksum then the FS checksum (plus T-10 DIF/DIX)
> Paul> would not provide a rigorous assurance that the bytes are the
> Paul> same.  Without this assurance, corruption could still occur; for
> Paul> example, within the server''s memory.
> 
> For DIX we allow integrity metadata conversion.  Once the data is
> received, the server generates appropriate IMD for the next layer.  Then
> the server verifies that the original IMD matches the data buffer.  That
> way there''s no window of error.  But obviously the ideal case is
where
> the same IMD can be passed throughout the stack without conversion.
I think we may be talking slightly at cross-purposes here: in my case, one of 
the end-points (for "end-to-end data integrity") is a remote computer,
that is
uploading a file with a corresponding checksum.

Please correct me if I''m wrong here, but T10 DIF/DIX refers only to
data
integrity protection from the OS''s FS-level down to the block device: a
userland application doesn''t know that it is writing into a FS that is 
utilising DIX with a DIF-enabled storage system.

When a file is uploaded from a remote client to an application with the 
checksum, the app can verify this checksum internally.  However,
there''s then
a (logical) gap between userland and FS where data integrity is no longer 
assured.  For example, corruption that occurs after the app has verified the 
checksum value would not be picked up, even with T10 DIX/DIF, since the FS 
would receive and store the already-corrupted data "in good faith".

In principle, one can add a btrfs-specific mechanism to continue this 
assurance from userland down to the FS.  Perhaps the simplest would be to 
allow userland applications to read the FS''s internal checksum (app
would read
the FS internal checksum after writing and verify it is consistent), but I 
guess more sophisticated (interleaved IMD, T10-like) mechanisms are also 
possible.

Unfortunately, any such solution would be btrfs-specific, since (I believe) no 
one has standardised how to extend T10 into userspace.

> Not sure what you use for file service?  I believe NFSv4 allows for
> checksums to be passed along. I have not looked at them closely yet,
> though.
I believe NFS currently doesn''t support checksums (as per v4.1). 
Looking into
more detail, Alok Aggarwal gave a talk at 2006 connectathon about this.  
Alok''s slides have a nice diagram (slide 11) showing the kind of
end-to-end
integrity I''m after.  The issue is how to achieve the assurance between
"NFS
Server" and "Local FS" on the right.

For NFS, I believe there aren''t any plans for introducing checksum
support for
v4.2.  Perhaps it''ll appear with the later minor versions of the
standard.

Cheers,

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2010-Jun-02 16:19 UTC

head link

Re: A couple of questions

On Monday 31 May 2010 19:59:46 Paul Millar wrote:> Hi Hubert,
> 
> On Thursday 27 May 2010 16:56:00 Hubert Kario wrote:
> > > Would [obtaining file checksum] be possible (without an awful lot
> > > of work)?
> > 
> > [Calculating checksum in-memory]  won''t detect in-memory
corruption
> > though, but if you want to be resilant to this, you should be looking
at
> > 
> >  ECC RAM as subsequent checks can be affected by it to.
> 
> Certainly ECC RAM will help, but unfortunately it doesn''t remove
the
> possibility of corruption; for example, CERN found [1] that double-bit
> memory corruptions (which ECC cannot recover from) can still happen.
> 
> [1]
>
http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&mat
> erialId=paper&confId=13797
> 
> Also, IIRC there was a case where Fermilab tracked down a data corruption
> to a faulty PCI bus in the server.  So who knows where are all the places
> corruption could occur?
> 
> I guess the real problem is that, when processing large amounts of data,
> these rare occurrences start to stack up.
> 
Yes, but AFAIK btrfs checksums don''t have internal checksum (e.g. you
can''t
check if the read checksum is a valid one or not, it does not have control 
bits), as such, if you consider PCI bus corruption as likely, you still
don''t
get 100% certanity that the data reached the HDD unharmed.

If you need such level of certanity when recording data, I''d consider 
mainframe hardware and/or duplicating whole storage stack.

Cheers,
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

System Zarządzania Jakością
zgodny z normą ISO 9001:2000
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin K. Petersen

2010-Jun-04 01:17 UTC

head link

Re: A couple of questions

>>>>> "Paul" == Paul Millar <paul.millar@desy.de>
writes:
Paul> Please correct me if I''m wrong here, but T10 DIF/DIX refers
only
Paul> to data integrity protection from the OS''s FS-level down to
the
Paul> block device: a userland application doesn''t know that it is
Paul> writing into a FS that is utilising DIX with a DIF-enabled storage
Paul> system.

My point was that it is possible to have different protection types in
play (and thus different checksums) as long as you overlap the
protection envelopes.  At the expense of having to calculate checksums
multiple times, of course.


Paul> Unfortunately, any such solution would be btrfs-specific, since (I
Paul> believe) no one has standardised how to extend T10 into userspace.

Not yet, but we''re working on a generic interface that would allow the
protection information to be attached.  This is not going to be tied to
just T10 DIF.  The current Linux block layer integrity handles different
types of protection information.


Paul> I believe NFS currently doesn''t support checksums (as per
v4.1).
Paul> Looking into more detail, Alok Aggarwal gave a talk at 2006
Paul> connectathon about this.  Alok''s slides have a nice diagram
(slide
Paul> 11) showing the kind of end-to-end integrity I''m after.  The
issue
Paul> is how to achieve the assurance between "NFS Server" and
"Local
Paul> FS" on the right.

Paul> For NFS, I believe there aren''t any plans for introducing
checksum
Paul> support for v4.2.  Perhaps it''ll appear with the later minor
Paul> versions of the standard.

I haven''t looked into this for a long time.  Last time I talked to the
NFS folks they seemed to think it would be possible to bridge the two
methods.

-- 
Martin K. Petersen	Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reasonably Related Threads

Search for more reasonably related threads

Btrfs devel - May 2010 - A couple of questions

A couple of questions

Re: A couple of questions

Re: A couple of questions

Re: A couple of questions

Re: A couple of questions

Re: A couple of questions

Re: A couple of questions

Re: A couple of questions

Re: A couple of questions

Re: A couple of questions

Re: A couple of questions

Reasonably Related Threads