So I guess there are some things I _still_ don''t understand about Lustre metadata handling. Specifically, what metadata gets stored on OSTs and why. What brings this all up is that a) we have users who have lots of files and b) we recently are doing through some reorganization that requires changing the groups on lots of these files (this is all running Lustre 1.8.4; we''re due for an upgrade in the medium future). I figured okay, this wouldn''t be so bad, since those are all metadata server operations. But I started running some tests, and I found out that chown() system calls perform poorly. Because I was doing some previous metadata performance analysis, I took a souce code tree which consists of approximately 50,000 files and put two copies in one of our Lustre filesystems: one with the default striping (across all OSTs) and one where all files have no striping at all. The performance between these two trees for stat() calls is large, as you can imagine, but the disparity between the chown() calls is even larger. You can run chgrp on all of the files in the no-striped copy in about 3-5 seconds, but the striped copy takes more than 50 seconds. I did some more digging as to why this is. I thought maybe at first that this is an issue on the client, but there is code in there that skips over talking to the OSTs for certain types of metadata updates, and turning on debugging on the client verifies that no setattr RPCs are being sent to the OSSes. Looking more closely at the RPC traces reveals that the issue is on the metadata server; the setattr RPCs simply take longer when the files are striped. I''ve looked at the metadata server code for a bit, and I''ve verified that the metadata server does send setattr RPCs to the OSSes, but I see that it''s done asynchronously; it shouldn''t be waiting for the replies. So I''m stumped as to why this is happening. I also realize that I''m still puzzled as to what metadata is stored on the OSTs; it seems like the client prefers the metadata from the MDS (except of course for size), but a fair amount of metadata is still stored on the OSSes. Can anyone shed some light on this? --Ken
Ken, the OSTs need to track the ownership of objects for quota. The more stripes there are on a file, the more RPCs that need to be sent, which is why we don''t recommend wide striping unless there is a reason for it (bandwidth, size, etc). Cheers, Andreas On 2011-05-20, at 7:49 AM, Ken Hornstein <kenh at cmf.nrl.navy.mil> wrote:> So I guess there are some things I _still_ don''t understand about Lustre > metadata handling. Specifically, what metadata gets stored on OSTs and > why. > > What brings this all up is that a) we have users who have lots of files > and b) we recently are doing through some reorganization that requires > changing the groups on lots of these files (this is all running Lustre > 1.8.4; we''re due for an upgrade in the medium future). > > I figured okay, this wouldn''t be so bad, since those are all metadata > server operations. But I started running some tests, and I found out > that chown() system calls perform poorly. > > Because I was doing some previous metadata performance analysis, I took > a souce code tree which consists of approximately 50,000 files and put > two copies in one of our Lustre filesystems: one with the default striping > (across all OSTs) and one where all files have no striping at all. The > performance between these two trees for stat() calls is large, as you > can imagine, but the disparity between the chown() calls is even larger. > You can run chgrp on all of the files in the no-striped copy in about > 3-5 seconds, but the striped copy takes more than 50 seconds. > > I did some more digging as to why this is. I thought maybe at first that > this is an issue on the client, but there is code in there that skips > over talking to the OSTs for certain types of metadata updates, and turning > on debugging on the client verifies that no setattr RPCs are being sent > to the OSSes. Looking more closely at the RPC traces reveals that the issue > is on the metadata server; the setattr RPCs simply take longer when the > files are striped. > > I''ve looked at the metadata server code for a bit, and I''ve verified > that the metadata server does send setattr RPCs to the OSSes, but I see > that it''s done asynchronously; it shouldn''t be waiting for the > replies. So I''m stumped as to why this is happening. I also realize > that I''m still puzzled as to what metadata is stored on the OSTs; it seems > like the client prefers the metadata from the MDS (except of course for > size), but a fair amount of metadata is still stored on the OSSes. Can > anyone shed some light on this? > > --Ken > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>Ken, the OSTs need to track the ownership of objects for quota. The more >stripes there are on a file, the more RPCs that need to be sent, which is why >we don''t recommend wide striping unless there is a reason for it (bandwidth, >size, etc).Fair enough; I always forget about quota accounting, because we never use it. But I''m wondering why this in particular causes such a hit, because the MDS sends the setattr RPCs asynchronously; in theory it should just fire them off and not have to wait until they''re done. Perhaps it''s the overhead of sending those RPCs which is slowing things down? I could believe that, although I would have thought that it wouldn''t be that bad. --Ken
It would be interesting to find out what is causing the bottleneck. At one time there was no throttle on the number of RPCs that the MDS could send, which caused overload problems on the OSTs. Now, the MDS is limited by the normal rpcs_in_flight tunable (=8) that clients are limited to. It would be worthwhile to see if increasing this helped the overall performance? If yes, then it would make sense to tune the OSCs on the MDS for more RPCs by default. Cheers, Andreas On 2011-05-20, at 10:47 AM, Ken Hornstein <kenh at cmf.nrl.navy.mil> wrote:>> Ken, the OSTs need to track the ownership of objects for quota. The more >> stripes there are on a file, the more RPCs that need to be sent, which is why >> we don''t recommend wide striping unless there is a reason for it (bandwidth, >> size, etc). > > Fair enough; I always forget about quota accounting, because we never use > it. But I''m wondering why this in particular causes such a hit, because > the MDS sends the setattr RPCs asynchronously; in theory it should just > fire them off and not have to wait until they''re done. Perhaps it''s the > overhead of sending those RPCs which is slowing things down? I could believe > that, although I would have thought that it wouldn''t be that bad. > > --Ken > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss