Hi all, I''m copying around data between 2 MDTs in a test system. Having mounted the partitione as ''ldiskfs'', I had a look in MDT/ROOT. I found all my test data there, but I''m puzzled by the indicated file sizes. For example I had put one of my holiday''s movies, it''s 40MB. On the ldiskfs-mounted MDT, I find a corresponding entry, which also has 40MB, as given by ''ls -lh''. Of course, the latter file doesn''t have the contents of that movie, but why is it the same size? ''ls -li'' also gives identical results, btw. On the other hand, there is another movie which is .6.4MB as such, but 0B on the MDT partition. Both movies play nicely, so there is no problem with this file system. On our production system, the MDT takes 13GB for 250TB of data, obviously there aren''t entries on the MDT taking the size of the real data files ;-) So my question is whether the file size reported by ''ls'' on the MDT as any practical implication? -- -------------------------------------------------------------------- Thomas Roth IT-Department Location: SB3 1.262 Phone: +49-6159-71 1453 ''We apologise for the inconvenience.''
On Jul 27, 2009 14:24 +0200, Thomas Roth wrote:> I''m copying around data between 2 MDTs in a test system. Having mounted > the partitione as ''ldiskfs'', I had a look in MDT/ROOT. I found all my > test data there, but I''m puzzled by the indicated file sizes. For > example I had put one of my holiday''s movies, it''s 40MB. On the > ldiskfs-mounted MDT, I find a corresponding entry, which also has 40MB, > as given by ''ls -lh''. Of course, the latter file doesn''t have the > contents of that movie, but why is it the same size? ''ls -li'' also gives > identical results, btw. > On the other hand, there is another movie which is .6.4MB as such, but > 0B on the MDT partition.In Lustre 1.6.7 the "approximate" file size started to be stored on the MDT inodes in order to facilitate[*] filesystem backup utilities to allow them to have a fast estimate of the file size w/o having to access the OST objects (that hold the authoritative size). This size cannot be used as the official file size in 1.x because there isn''t sufficient locking and recovery of the size in case of a crash, though a preview of this feature (Size On MDS, SOM) will be available in the 2.0 release.> Both movies play nicely, so there is no problem with this file system. > On our production system, the MDT takes 13GB for 250TB of data, > obviously there aren''t entries on the MDT taking the size of the real > data files ;-) > > So my question is whether the file size reported by ''ls'' on the MDT as > any practical implication?This size is not actively updated for pre-existing files, nor is it always guaranteed to be written in case of a crash, which is why you see some (likely older) files that do not have the size information. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hi Andreas, Andreas Dilger wrote:> On Jul 27, 2009 14:24 +0200, Thomas Roth wrote: >> I''m copying around data between 2 MDTs in a test system. Having mounted >> the partitione as ''ldiskfs'', I had a look in MDT/ROOT. I found all my >> test data there, but I''m puzzled by the indicated file sizes. For >> example I had put one of my holiday''s movies, it''s 40MB. On the >> ldiskfs-mounted MDT, I find a corresponding entry, which also has 40MB, >> as given by ''ls -lh''. Of course, the latter file doesn''t have the >> contents of that movie, but why is it the same size? ''ls -li'' also gives >> identical results, btw. >> On the other hand, there is another movie which is .6.4MB as such, but >> 0B on the MDT partition. > > In Lustre 1.6.7 the "approximate" file size started to be stored on the > MDT inodes in order to facilitate[*] filesystem backup utilities to > allow them to have a fast estimate of the file size w/o having to access > the OST objects (that hold the authoritative size). This size cannot > be used as the official file size in 1.x because there isn''t sufficient > locking and recovery of the size in case of a crash, though a preview of > this feature (Size On MDS, SOM) will be available in the 2.0 release.I get the impression that this feature hampers the device level backup - or is it file level backup: Extracting extended attributes and make a tar archive of the MDT: the latter step now takes 5 days on our production system (which is 1.6.7.1). And right now I''m trying to do a rsync - copy of that MDT. When that seemed to be stuck with a particular, I checked the file, on the source, albeit primitively with "ls -lh". That told me that the file was 9.1GB, and the rsync behaves just as you would expect when it has to transfer 9GB over the network - takes some time. In fact, there are several of these files, and as I mentioned, the MDT takes only 13GB on disk, so all of this is a bit confusing. The first attempts to copy the MDT resulted immediately in a target file system blown up beyond proportions. I have since added the options "--sparse" to my rsync command line. Now the target system seems to keep small, but I have yet to see if the result could be used as an MDT at all. Of course all this may just be due to our MDT being damaged somehow ...>> Both movies play nicely, so there is no problem with this file system. >> On our production system, the MDT takes 13GB for 250TB of data, >> obviously there aren''t entries on the MDT taking the size of the real >> data files ;-) >> >> So my question is whether the file size reported by ''ls'' on the MDT as >> any practical implication? > > This size is not actively updated for pre-existing files, nor is it > always guaranteed to be written in case of a crash, which is why you > see some (likely older) files that do not have the size information. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >Regards, Thomas
On Jul 28, 2009 11:22 +0200, Thomas Roth wrote:> Andreas Dilger wrote: > > On Jul 27, 2009 14:24 +0200, Thomas Roth wrote: > >> I''m copying around data between 2 MDTs in a test system. Having mounted > >> the partitione as ''ldiskfs'', I had a look in MDT/ROOT. I found all my > >> test data there, but I''m puzzled by the indicated file sizes. For > >> example I had put one of my holiday''s movies, it''s 40MB. On the > >> ldiskfs-mounted MDT, I find a corresponding entry, which also has 40MB, > >> as given by ''ls -lh''. Of course, the latter file doesn''t have the > >> contents of that movie, but why is it the same size? ''ls -li'' also gives > >> identical results, btw. > >> On the other hand, there is another movie which is .6.4MB as such, but > >> 0B on the MDT partition. > > > > In Lustre 1.6.7 the "approximate" file size started to be stored on the > > MDT inodes in order to facilitate[*] filesystem backup utilities to > > allow them to have a fast estimate of the file size w/o having to access > > the OST objects (that hold the authoritative size). This size cannot > > be used as the official file size in 1.x because there isn''t sufficient > > locking and recovery of the size in case of a crash, though a preview of > > this feature (Size On MDS, SOM) will be available in the 2.0 release. > > I get the impression that this feature hampers the device level backup - > or is it file level backup: Extracting extended attributes and make a > tar archive of the MDT: the latter step now takes 5 days on our > production system (which is 1.6.7.1). And right now I''m trying to do a > rsync - copy of that MDT. When that seemed to be stuck with a > particular, I checked the file, on the source, albeit primitively with > "ls -lh". That told me that the file was 9.1GB, and the rsync behaves > just as you would expect when it has to transfer 9GB over the network - > takes some time. In fact, there are several of these files, and as I > mentioned, the MDT takes only 13GB on disk, so all of this is a bit > confusing.You are right. In some use cases this feature has hampered backup. It is possible to use a block-device level backup (e.g. dd or dump) without problems. I use "dd" locally to do MDS backups so I didn''t notice this issue during testing.> The first attempts to copy the MDT resulted immediately in a target file > system blown up beyond proportions. I have since added the options > "--sparse" to my rsync command line. Now the target system seems to keep > small, but I have yet to see if the result could be used as an MDT at all.It would also be possible to modify tar and rsync to use the "FIEMAP" support available in newer versions of the kernel (2.6.27 at least), so that it doesn''t have to read all of the data from the file. This would result in much faster backups for any kind of sparse files, but as yet that work hasn''t been done. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.