Jeff Johnson
2012-May-29 05:08 UTC
[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files
Greetings, I am aiding in the recovery of a multi-Petabyte Lustre filesystem (1.8.7) that went down hard due to site wide power loss. Power loss caused the MDT RAID volume to be put in a critical state and I was able to get the md raid based MDT device mounted read only and the MDT mounted read only as type ldiskfs. I was able to successfully backup the extended attributes of the MDT. This process took about 10 minutes. The tar backup of the MDT is taking a very long time. So far it has backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar process pointers to small or average size files are backed up quickly and at a consistent pace. When tar encounters a pointer/inode belonging to a very large file (100GB+) the tar process stalls on that file for a very long time, as if it were trying to archive the real filesize amount of data rather than the pointer/inode. During this process there are no errors reported by kernel, ldiskfs, md or tar. Nothing that would indiciate why things are so slow on pointers to large files. In watching the tar process the CPU utilization is at or near 100% so it is doing something. Running iostat at the same time shows that while tar is at or near 100% CPU there are no reads taking place on the MDT device and no writes to the device where the tarball is being written. It appears that the tar process goes to outer space when it encounters pointers to very large files. Is this expected behavior? The backup command used is the one from the MDT backup process in the 1.8 manual: ''tar zcvf <tarfile> --sparse .'' df reports the ldiskfs MDT as 5GB used: /dev/md0 2636788616 5192372 2455778504 1% /mnt/mdt df -i reports the ldiskfs MDT as having 10,300,000 inodes used: /dev/md0 1758199808 10353389 1747846419 1% /mnt/mdt Any feedback is appreciated! --Jeff -- ------------------------------ Jeff Johnson Partner Aeon Computing jeff dot johnson at aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x101 ? f: 858-412-3845 4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117
Peter Grandi
2012-May-29 19:28 UTC
[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files
[ ... ]> The tar backup of the MDT is taking a very long time. So far it has > backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar > process pointers to small or average size files are backed up quickly > and at a consistent pace. When tar encounters a pointer/inode > belonging to a very large file (100GB+) the tar process stalls on that > file for a very long time, as if it were trying to archive the real > filesize amount of data rather than the pointer/inode.If you have stripes on, a 100GiB file will have 100,000 1MiB stripes, and each requires a chunk of metadata. The descriptor for that file will have this potentially a very large number of extents, scattered around the MDT block device, depending on how slowly the file grew etc.
Andreas Dilger
2012-May-30 22:02 UTC
[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files
On 2012-05-29, at 1:28 PM, Peter Grandi wrote:>> The tar backup of the MDT is taking a very long time. So far it has >> backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar >> process pointers to small or average size files are backed up quickly >> and at a consistent pace. When tar encounters a pointer/inode >> belonging to a very large file (100GB+) the tar process stalls on that >> file for a very long time, as if it were trying to archive the real >> filesize amount of data rather than the pointer/inode. > > If you have stripes on, a 100GiB file will have 100,000 1MiB > stripes, and each requires a chunk of metadata. The descriptor > for that file will have this potentially a very large number of > extents, scattered around the MDT block device, depending on how > slowly the file grew etc.While that may be true for other distributed filesystems, that is not true for Lustre at all. The size of a Lustre object is not fixed to a "chunk size" like 32MB or similar, but rather is variable depending on the size of the file itself. The number of "stripes" (== objects) on a file is currently fixed at file creation time, and the MDS only needs to store the location of each stripe (at most one per OST). The actual blocks/extents of the objects are managed inside the OST itself and are never seen by the client or the MDS. Cheers, Andreas -- Andreas Dilger Whamcloud, Inc. Principal Lustre Engineer http://www.whamcloud.com/
Alex Kulyavtsev
2012-May-30 22:50 UTC
[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files
Is this the same issue as at "backup MDT question" (and follow up) http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010151.html due to sparse files on MDT? Does tar take a lot of CPU? Alex. On May 30, 2012, at 5:02 PM, Andreas Dilger wrote:>>> The tar backup of the MDT is taking a very long time. So far it has >>> backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar >>> process pointers to small or average size files are backed up >>> quickly >>> and at a consistent pace. When tar encounters a pointer/inode >>> belonging to a very large file (100GB+) the tar process stalls on >>> that >>> file for a very long time, as if it were trying to archive the real >>> filesize amount of data rather than the pointer/inode.-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120530/76c5bab5/attachment.html
Jeff Johnson
2012-May-30 22:57 UTC
[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files
Following up on my original post. I switched from /bin/tar that comes with RHEL/CentOS 5.x to thw Whamcloud patched tar utility. The entire backup was successful and took only 12 hours to complete. The CPU utilization was high >90% but only on one core. The process was much faster than the standard tar shipped in RHEL/CentOS and the only slow downs were on file pointers to very large files (100TB+) with large stripe counts. The files that were going very slow when I reported the initial problem were backed up instantly with the Whamcloud version of tar. Best part, the MDT was saved and the 4PB filesystem is in production again. --Jeff On 5/30/12 3:02 PM, Andreas Dilger wrote:> On 2012-05-29, at 1:28 PM, Peter Grandi wrote: >>> The tar backup of the MDT is taking a very long time. So far it has >>> backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar >>> process pointers to small or average size files are backed up quickly >>> and at a consistent pace. When tar encounters a pointer/inode >>> belonging to a very large file (100GB+) the tar process stalls on that >>> file for a very long time, as if it were trying to archive the real >>> filesize amount of data rather than the pointer/inode. >> If you have stripes on, a 100GiB file will have 100,000 1MiB >> stripes, and each requires a chunk of metadata. The descriptor >> for that file will have this potentially a very large number of >> extents, scattered around the MDT block device, depending on how >> slowly the file grew etc. > While that may be true for other distributed filesystems, that is > not true for Lustre at all. The size of a Lustre object is not > fixed to a "chunk size" like 32MB or similar, but rather is > variable depending on the size of the file itself. The number of > "stripes" (== objects) on a file is currently fixed at file > creation time, and the MDS only needs to store the location of > each stripe (at most one per OST). The actual blocks/extents of > the objects are managed inside the OST itself and are never seen > by the client or the MDS. > > Cheers, Andreas > -- > Andreas Dilger Whamcloud, Inc. > Principal Lustre Engineer http://www.whamcloud.com/ > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- ------------------------------ Jeff Johnson Manager Aeon Computing jeff.johnson at aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x101 f: 858-412-3845 m: 619-204-9061 4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117