thr3ads.net - Lustre discuss - [Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files [May 2012]

If this information is useful, please help other people find it:
Share via:

Jeff Johnson

2012-May-29 05:08 UTC

[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

Greetings,

I am aiding in the recovery of a multi-Petabyte Lustre filesystem
(1.8.7) that went down hard due to site wide power loss. Power loss
caused the MDT RAID volume to be put in a critical state and I was
able to get the md raid based MDT device mounted read only and the MDT
mounted read only as type ldiskfs.

I was able to successfully backup the extended attributes of the MDT.
This process took about 10 minutes.

The tar backup of the MDT is taking a very long time. So far it has
backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar
process pointers to small or average size files are backed up quickly
and at a consistent pace. When tar encounters a pointer/inode
belonging to a very large file (100GB+) the tar process stalls on that
file for a very long time, as if it were trying to archive the real
filesize amount of data rather than the pointer/inode.

During this process there are no errors reported by kernel, ldiskfs,
md or tar. Nothing that would indiciate why things are so slow on
pointers to large files. In watching the tar process the CPU
utilization is at or near 100% so it is doing something. Running
iostat at the same time shows that while tar is at or near 100% CPU
there are no reads taking place on the MDT device and no writes to the
device where the tarball is being written.

It appears that the tar process goes to outer space when it encounters
pointers to very large files. Is this expected behavior?

The backup command used is the one from the MDT backup process in the
1.8 manual: ''tar zcvf <tarfile> --sparse .''

df reports the ldiskfs MDT as 5GB used:
/dev/md0           2636788616   5192372 2455778504   1% /mnt/mdt

df -i reports the ldiskfs MDT as having 10,300,000 inodes used:
/dev/md0           1758199808 10353389 1747846419    1% /mnt/mdt

Any feedback is appreciated!

--Jeff


--
------------------------------
Jeff Johnson
Partner
Aeon Computing

jeff dot johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x101 ? f: 858-412-3845

4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117

Peter Grandi

2012-May-29 19:28 UTC

head link

[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

[ ... ]
> The tar backup of the MDT is taking a very long time. So far it has
> backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar
> process pointers to small or average size files are backed up quickly
> and at a consistent pace. When tar encounters a pointer/inode
> belonging to a very large file (100GB+) the tar process stalls on that
> file for a very long time, as if it were trying to archive the real
> filesize amount of data rather than the pointer/inode.
If you have stripes on, a 100GiB file will have 100,000 1MiB
stripes, and each requires a chunk of metadata. The descriptor
for that file will have this potentially a very large number of
extents, scattered around the MDT block device, depending on how
slowly the file grew etc.

Andreas Dilger

2012-May-30 22:02 UTC

head link

[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

On 2012-05-29, at 1:28 PM, Peter Grandi wrote:>> The tar backup of the MDT is taking a very long time. So far it has
>> backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar
>> process pointers to small or average size files are backed up quickly
>> and at a consistent pace. When tar encounters a pointer/inode
>> belonging to a very large file (100GB+) the tar process stalls on that
>> file for a very long time, as if it were trying to archive the real
>> filesize amount of data rather than the pointer/inode.
> 
> If you have stripes on, a 100GiB file will have 100,000 1MiB
> stripes, and each requires a chunk of metadata. The descriptor
> for that file will have this potentially a very large number of
> extents, scattered around the MDT block device, depending on how
> slowly the file grew etc.
While that may be true for other distributed filesystems, that is
not true for Lustre at all.  The size of a Lustre object is not
fixed to a "chunk size" like 32MB or similar, but rather is
variable depending on the size of the file itself.  The number of
"stripes" (== objects) on a file is currently fixed at file
creation time, and the MDS only needs to store the location of
each stripe (at most one per OST).  The actual blocks/extents of
the objects are managed inside the OST itself and are never seen
by the client or the MDS.

Cheers, Andreas
--
Andreas Dilger                       Whamcloud, Inc.
Principal Lustre Engineer            http://www.whamcloud.com/

Alex Kulyavtsev

2012-May-30 22:50 UTC

head link

[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

Is this the same issue as at "backup MDT question" (and follow up)
	http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010151.html
due to sparse files on MDT?  Does tar take a lot of CPU?
Alex.

On May 30, 2012, at 5:02 PM, Andreas Dilger wrote:
>>> The tar backup of the MDT is taking a very long time. So far it has
>>> backed up 1.6GB of the 5.0GB used in nine hours. In watching the
tar
>>> process pointers to small or average size files are backed up  
>>> quickly
>>> and at a consistent pace. When tar encounters a pointer/inode
>>> belonging to a very large file (100GB+) the tar process stalls on  
>>> that
>>> file for a very long time, as if it were trying to archive the real
>>> filesize amount of data rather than the pointer/inode.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120530/76c5bab5/attachment.html

Jeff Johnson

2012-May-30 22:57 UTC

head link

[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

Following up on my original post. I switched from /bin/tar that comes 
with RHEL/CentOS 5.x to thw Whamcloud patched tar utility. The entire 
backup was successful and took only 12 hours to complete. The CPU 
utilization was high >90% but only on one core. The process was much 
faster than the standard tar shipped in RHEL/CentOS and the only slow 
downs were on file pointers to very large files (100TB+) with large 
stripe counts. The files that were going very slow when I reported the 
initial problem were backed up instantly with the Whamcloud version of tar.

Best part, the MDT was saved and the 4PB filesystem is in production again.

--Jeff



On 5/30/12 3:02 PM, Andreas Dilger wrote:> On 2012-05-29, at 1:28 PM, Peter Grandi wrote:
>>> The tar backup of the MDT is taking a very long time. So far it has
>>> backed up 1.6GB of the 5.0GB used in nine hours. In watching the
tar
>>> process pointers to small or average size files are backed up
quickly
>>> and at a consistent pace. When tar encounters a pointer/inode
>>> belonging to a very large file (100GB+) the tar process stalls on
that
>>> file for a very long time, as if it were trying to archive the real
>>> filesize amount of data rather than the pointer/inode.
>> If you have stripes on, a 100GiB file will have 100,000 1MiB
>> stripes, and each requires a chunk of metadata. The descriptor
>> for that file will have this potentially a very large number of
>> extents, scattered around the MDT block device, depending on how
>> slowly the file grew etc.
> While that may be true for other distributed filesystems, that is
> not true for Lustre at all.  The size of a Lustre object is not
> fixed to a "chunk size" like 32MB or similar, but rather is
> variable depending on the size of the file itself.  The number of
> "stripes" (== objects) on a file is currently fixed at file
> creation time, and the MDS only needs to store the location of
> each stripe (at most one per OST).  The actual blocks/extents of
> the objects are managed inside the OST itself and are never seen
> by the client or the MDS.
>
> Cheers, Andreas
> --
> Andreas Dilger                       Whamcloud, Inc.
> Principal Lustre Engineer            http://www.whamcloud.com/
>
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
------------------------------
Jeff Johnson
Manager
Aeon Computing

jeff.johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x101   f: 858-412-3845
m: 619-204-9061

4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117

Lustre discuss - May 2012 - Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files