thr3ads.net - CentOS - [CentOS] Filesystem that doesn't store duplicate data [Dec 2007]

If this information is useful, please help other people find it:
Share via:

rsivak at istandfor.com

2007-Dec-06 01:18 UTC

[CentOS] Filesystem that doesn't store duplicate data

Is there such a filesystem available?  It seems like it wouldn't be too hard
to implement...  Basically do things on a block by block basis.  Store md5 of a
block in the table, and when writing a new block, check if the md5 already
exists and then point the new block to the old block.  Since md5 is not
guaranteed unique, might need to do a diff between the 2 blocks and if the
blocks are indeed different, handle it somehow.

When modifying an existing block that has multiple pointers, copy the block and
modify the new block.

I know I'm oversimplifying things a lot, but something like this could work,
no?  Would be a great filesystem to store backups on, or things like vmware
volumes...

Russ
Sent from my Verizon Wireless BlackBerry

rsivak at istandfor.com

2007-Dec-06 01:33 UTC

head link

[CentOS] Filesystem that doesn't store duplicate data

Is there such a filesystem available?  It seems like it wouldn't be too hard
to implement...  Basically do things on a block by block basis.  Store md5 of a
block in the table, and when writing a new block, check if the md5 already
exists and then point the new block to the old block.  Since md5 is not
guaranteed unique, might need to do a diff between the 2 blocks and if the
blocks are indeed different, handle it somehow.

When modifying an existing block that has multiple pointers, copy the block and
modify the new block.

I know I'm oversimplifying things a lot, but something like this could work,
no?  Would be a great filesystem to store backups on, or things like vmware
volumes...

Russ
Sent from my Verizon Wireless BlackBerry

redhat at mckerrs.net

2007-Dec-06 01:50 UTC

head link

[CentOS] Filesystem that doesn't store duplicate data

----- Original Message ----- 
From: rsivak at istandfor.com 
To: "CentOS Mailing list" <centos at centos.org> 
Sent: Thursday, December 6, 2007 11:18:16 AM (GMT+1000) Australia/Brisbane 
Subject: [CentOS] Filesystem that doesn't store duplicate data 

Is there such a filesystem available? It seems like it wouldn't be too hard
to implement... Basically do things on a block by block basis. Store md5 of a
block in the table, and when writing a new block, check if the md5 already
exists and then point the new block to the old block. Since md5 is not
guaranteed unique, might need to do a diff between the 2 blocks and if the
blocks are indeed different, handle it somehow.

When modifying an existing block that has multiple pointers, copy the block and
modify the new block.

I know I'm oversimplifying things a lot, but something like this could work,
no? Would be a great filesystem to store backups on, or things like vmware
volumes...

Russ 
Sent from my Verizon Wireless BlackBerry 
_______________________________________________ 
CentOS mailing list 
CentOS at centos.org 
http://lists.centos.org/mailman/listinfo/centos 

-- 
This message has been scanned for viruses and 
dangerous content by MailScanner, and is 
believed to be clean. 

You are describing what I understand to be 'Data De-duplication". It is
all the rage for backups as it has the potential to decrease backup times and
volumes by significant amounts. I went to a presentation by Avamar (a partner of
EMC ?) regarding this technology and it seemed really nice for your typical
windows file server. I suppose it effectively turns your data into
'single-instance' which is no bad thing. I suppose it could be useful
for large database backups as well.

You'd think that using this technology on a live filesystem could incur a
significant performance penalty due to all those calculations (fuse module
anyone ?). Imagine a hardware optimized data de-duplication disk controller,
similar to raid XOR optimized cpus. Now that would be cool. All it would need to
store was meta-data when it had already seen the exact same block. I think
fundamentally it is similar in result to on the fly disk compression.

Let us know when you have a beta to test ! 

8^) 

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20071206/9f69b7f8/attachment-0003.html>

Peter Arremann

2007-Dec-06 01:55 UTC

head link

[CentOS] Filesystem that doesn't store duplicate data

On Wednesday 05 December 2007, redhat at mckerrs.net
wrote:> You'd think that using this technology on a live filesystem could incur
a
> significant performance penalty due to all those calculations (fuse module
> anyone ?). Imagine a hardware optimized data de-duplication disk
> controller, similar to raid XOR optimized cpus. Now that would be cool. All
> it would need to store was meta-data when it had already seen the exact
> same block. I think fundamentally it is similar in result to on the fly
> disk compression.
Actually, the impact - if the filesystem is designed correctly - shouldn't
be
that horrible. After all, Sun has managed to integrate checksums into ZFS and 
still get great performance. In addition, ZFS doesn't directly overwrite
data
but uses a new datablock each time...

What you would have to do then is keep a lookup table with the checksums to 
find possible matches quickly. Then when you find one, do another compare to 
be 100% sure you didn't have a collision on your checksums. If that works, 
then you can reference that datablock. 

It is still a lot of work, but as sun showed, on the fly compares and 
checksums are doable without too much of a hit.

Peter.

Ruslan Sivak

2007-Dec-06 05:00 UTC

head link

[CentOS] Filesystem that doesn't store duplicate data

redhat at mckerrs.net wrote:>
> ----- Original Message -----
> From: rsivak at istandfor.com
> To: "CentOS Mailing list" <centos at centos.org>
> Sent: Thursday, December 6, 2007 11:18:16 AM (GMT+1000) Australia/Brisbane
> Subject: [CentOS] Filesystem that doesn't store duplicate data
>
> Is there such a filesystem available?  It seems like it wouldn't be 
> too hard to implement...  Basically do things on a block by block 
> basis.  Store md5 of a block in the table, and when writing a new 
> block, check if the md5 already exists and then point the new block to 
> the old block.  Since md5 is not guaranteed unique, might need to do a 
> diff between the 2 blocks and if the blocks are indeed different, 
> handle it somehow.  
>
> When modifying an existing block that has multiple pointers, copy the 
> block and modify the new block.  
>
> I know I'm oversimplifying things a lot, but something like this could 
> work, no?  Would be a great filesystem to store backups on, or things 
> like vmware volumes...
>
> Russ
> Sent from my Verizon Wireless BlackBerry
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>
> You are describing what I understand to be 'Data De-duplication".
It
> is all the rage for backups as it has the potential to decrease backup 
> times and volumes by significant amounts. I went to a presentation by 
> Avamar (a partner of EMC ?) regarding this technology and it seemed 
> really nice for your typical windows file server. I suppose it 
> effectively turns your data into 'single-instance' which is no bad 
> thing. I suppose it could be useful for large database backups as well.
>
> You'd think that using this technology on a live filesystem could 
> incur a significant performance penalty due to all those calculations 
> (fuse module anyone ?). Imagine a hardware optimized data 
> de-duplication disk controller, similar to raid XOR optimized cpus. 
> Now that would be cool. All it would need to store was meta-data when 
> it had already seen the exact same block. I think fundamentally it is 
> similar in result to on the fly disk compression.
>
> Let us know when you have a beta to test !
>
> 8^)
>I'm not sure if this would be possible to make available on a disk 
controller, as I don't think a controller can store the amount of data 
necessary to store the hashes.  I am thinking of maybe making it as a 
fuse module.  I'm most familiar with Java, and there are fuse bindings 
for java.  I would love to make at least a proof of concept FS that does 
this.  Does fuse exist for windows?  How does one test a fuse module 
while developing it?

Russ

Ross S. W. Walker

2007-Dec-06 14:24 UTC

head link

[CentOS] Filesystem that doesn't store duplicate data

These are all good and valid issues.

Thinking about it some more I might just implement it as a system service that
scans given disk volumes in the background, keeps a hidden directory where it
stores it's state information and hardlinks named after the md5 hash of the
files on the volume. If a collission occurs with an existing md5 hash then the
new file is unlinked and re-linked to the md5 hash file, if an md5 hash file
exists with no secondary links then it is removed. Maybe monitor the journal or
use inotify to just get new files and once a week do a full volume scan.

This way the file system performs as well as it normally does and as things go
forward duplicate files are eliminated (combined). Of course the problem arises
of what to do when 1 duplicate is modified, but the other should remain the
same...

Of course what you said about revisions that differ just a little won't take
advantage of this, but it's file level so it only works with whole files,
still better then nothing.

-Ross


-----Original Message-----
From: centos-bounces at centos.org <centos-bounces at centos.org>
To: CentOS mailing list <centos at centos.org>
Sent: Thu Dec 06 08:10:38 2007
Subject: Re: [CentOS] Filesystem that doesn't store duplicate data

On Thursday 06 December 2007, Ross S. W. Walker wrote:> How about a FUSE file system (userland, ie NTFS 3G) that layers
> on top of any file system that supports hard links
That would be easy but I can see a few issues with that approach: 

1) On file level rather than block level you're going to be much more 
inefficient. I for one have gigabytes of revisions of files that have changed 
a little between each file. 

2) You have to write all datablocks to disk and then erase them again if you 
find a match. That will slow you down and create some weird behavior. I.e. 
you know the FS shouldn't store duplicate data, yet you can't use cp to
copy
a 10G file if only 9G are free. If you copy a 8G file, you see the usage 
increase till only 1G is free, then when your app closes the file, you are 
going to go back to 9G free... 

3) Rather than continuously looking for matches on block level, you have to 
search for matches on files that can be any size. That is fine if you have a 
100K file - but if you have a 100M or larger file, the checksum calculations 
will take you forever. This means rather than adding a specific, small 
penalty to every write call, you add a unknown penalty, proportional to file 
size when closing the file. Also, the fact that most C coders don't check
the
return code of close doesn't make me happy there... 

Peter.
_______________________________________________
CentOS mailing list
CentOS at centos.org
http://lists.centos.org/mailman/listinfo/centos

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20071206/2ede9fd7/attachment-0003.html>

Ruslan Sivak

2007-Dec-06 15:48 UTC

head link

[CentOS] Filesystem that doesn't store duplicate data

This is a bit different then what I was proposing.  I know that backupPC 
already does this on a file level, but I want a filesystem that does it 
at a block level.  File level only helps if you're backing up multiple 
systems and they all have the same exact files.  Block level would help 
a lot more I think.  You'd be able to do a full backup every night and 
have it only take up around the same space as a differential backup.  
Things like virtual machine disk images which a lot of times are clones 
of each other, could take up only a small additinal amount of space for 
each clone, proportional to the changes that are made to that disk image. 

Nobody really answered this, so I'll ask again.  Is there a windows 
version of Fuse?  How does one test a fuse filesystem while developing 
it?  Would be nice to just be able to run something from eclipse, once 
you've made your changes and have a drive mounted and ready to test.  
Being able to debug a filesystem while it's running would be great too.  
Anyone here with experience building Fuse filesystems?

Russ

Ross S. W. Walker wrote:>
> These are all good and valid issues.
>
> Thinking about it some more I might just implement it as a system 
> service that scans given disk volumes in the background, keeps a 
> hidden directory where it stores it's state information and hardlinks 
> named after the md5 hash of the files on the volume. If a collission 
> occurs with an existing md5 hash then the new file is unlinked and 
> re-linked to the md5 hash file, if an md5 hash file exists with no 
> secondary links then it is removed. Maybe monitor the journal or use 
> inotify to just get new files and once a week do a full volume scan.
>
> This way the file system performs as well as it normally does and as 
> things go forward duplicate files are eliminated (combined). Of course 
> the problem arises of what to do when 1 duplicate is modified, but the 
> other should remain the same...
>
> Of course what you said about revisions that differ just a little 
> won't take advantage of this, but it's file level so it only works 
> with whole files, still better then nothing.
>
> -Ross
>
>
> -----Original Message-----
> From: centos-bounces at centos.org <centos-bounces at centos.org>
> To: CentOS mailing list <centos at centos.org>
> Sent: Thu Dec 06 08:10:38 2007
> Subject: Re: [CentOS] Filesystem that doesn't store duplicate data
>
> On Thursday 06 December 2007, Ross S. W. Walker wrote:
> > How about a FUSE file system (userland, ie NTFS 3G) that layers
> > on top of any file system that supports hard links
>
> That would be easy but I can see a few issues with that approach:
>
> 1) On file level rather than block level you're going to be much more
> inefficient. I for one have gigabytes of revisions of files that have 
> changed
> a little between each file.
>
> 2) You have to write all datablocks to disk and then erase them again 
> if you
> find a match. That will slow you down and create some weird behavior. I.e.
> you know the FS shouldn't store duplicate data, yet you can't use
cp
> to copy
> a 10G file if only 9G are free. If you copy a 8G file, you see the usage
> increase till only 1G is free, then when your app closes the file, you are
> going to go back to 9G free...
>
> 3) Rather than continuously looking for matches on block level, you 
> have to
> search for matches on files that can be any size. That is fine if you 
> have a
> 100K file - but if you have a 100M or larger file, the checksum 
> calculations
> will take you forever. This means rather than adding a specific, small
> penalty to every write call, you add a unknown penalty, proportional 
> to file
> size when closing the file. Also, the fact that most C coders don't 
> check the
> return code of close doesn't make me happy there...
>
> Peter.
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
> ------------------------------------------------------------------------
> This e-mail, and any attachments thereto, is intended only for use by 
> the addressee(s) named herein and may contain legally privileged 
> and/or confidential information. If you are not the intended recipient 
> of this e-mail, you are hereby notified that any dissemination, 
> distribution or copying of this e-mail, and any attachments thereto, 
> is strictly prohibited. If you have received this e-mail in error, 
> please immediately notify the sender and permanently delete the 
> original and any copy or printout thereof.
> ------------------------------------------------------------------------
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>

Maybe Matching Threads

Search for more possibly parallel threads

CentOS - Dec 2007 - Filesystem that doesn't store duplicate data

[CentOS] Filesystem that doesn't store duplicate data

[CentOS] Filesystem that doesn't store duplicate data

[CentOS] Filesystem that doesn't store duplicate data

[CentOS] Filesystem that doesn't store duplicate data

[CentOS] Filesystem that doesn't store duplicate data

[CentOS] Filesystem that doesn't store duplicate data

[CentOS] Filesystem that doesn't store duplicate data

Maybe Matching Threads