Hello Folks, So I''m trying to get a theoretical understanding of stripe offsets in lustre. As I understand it, the default offset set to 0 results in all writes beginning at OSS0-OST0. With a default stripe of 4, doesn''t this lead to massive hotspots on OSS0-OST[0-3] (unless *all* writes are consistently large)? With our setup, we have 4 OSTs per OSS (well, the last OSS has 3, but that''s not important right now). This would appear, in theory, to put OSS0 in a very hot situation. That said, I wonder how efficient a solution setting the stripe offset of the root of the file system to -1 ("random") is to solving this theoretical situation (given my understanding of striping under lustre). In reality, we have a quite varied workload on our file systems with codes ranging from bio to astrophys and, as such, writes ranging from very small to very large. Any real-world experience with these situations? Are there strange inefficiencies or administrative difficulties that should be known previous to enabling "random" offsets? Any info would be greatly appreciated. ---------------- John White High Performance Computing Services (HPCS) (510) 486-7307 One Cyclotron Rd, MS: 50B-3209C Lawrence Berkeley National Lab Berkeley, CA 94720
John White wrote:> Hello Folks, > So I''m trying to get a theoretical understanding of stripe offsets in lustre. As I understand it, the default offset set to 0 results in all writes beginning at OSS0-OST0. With a default stripe of 4, doesn''t this lead to massive hotspots on OSS0-OST[0-3] (unless *all* writes are consistently large)? > > With our setup, we have 4 OSTs per OSS (well, the last OSS has 3, but that''s not important right now). This would appear, in theory, to put OSS0 in a very hot situation. > > That said, I wonder how efficient a solution setting the stripe offset of the root of the file system to -1 ("random") is to solving this theoretical situation (given my understanding of striping under lustre). > > In reality, we have a quite varied workload on our file systems with codes ranging from bio to astrophys and, as such, writes ranging from very small to very large. Any real-world experience with these situations? Are there strange inefficiencies or administrative difficulties that should be known previous to enabling "random" offsets? Any info would be greatly appreciated. > ---------------- > John WhiteHi John, AFIAK, -1 is the default, so objects are allocated randomly, avoiding the hot-spot situation you describe. Setting the offset to anything other than random is probably a bad idea. Setting the stripe count is a more complicated question. We set the filesystem not to stripe by default, but then set striping on datasets that we know are going to be hot. Many bio apps like to read a common data sets in parallel and striping really helps performance. The only problem we have historically seen has been OSTs becoming unbalanced over time. If an OST fills up, users get sporadic "filesystem full" errors even though df shows free space. We found that this was typically due to code being left in debug mode, and writing out multi TByte log files with no striping, which led to single OSTs filling up. Quotas are your friend here. Lustre 1.6 and later will switch to a weighted allocator if the OSTs start to become unbalanced (see section 24.4.4 in the 1.6 manual). Making the OST size as large as possible help too, so that the average file size << OST size. Ultimately you need to spend time educating your users about the pro and cons of striping, especially if you have a very mixed application set. Having good stats is important too; we have historical load graphs for the OSSs (ganglia) and for our OSTs (rrdtool data collection straight from the disk controllers). That helps us to identify times when users have got things wrong and a single OST/OSS is being hammered due to incorrect striping. Cheers, Guy -- Dr Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 ex 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
On 2009-11-24, at 12:17, John White wrote:> So I''m trying to get a theoretical understanding of stripe offsets > in lustre. As I understand it, the default offset set to 0 results > in all writes beginning at OSS0-OST0. With a default stripe of 4, > doesn''t this lead to massive hotspots on OSS0-OST[0-3] (unless *all* > writes are consistently large)?As previously mentioned, the default is NOT to always start files with OST0, but rather to have a "round-robin with precession" (not random as is commonly mentioned) so that the OST used for stripe 0 of each file is evenly distributed among OSTs, regardless of the stripe count.> With our setup, we have 4 OSTs per OSS (well, the last OSS has 3, > but that''s not important right now). This would appear, in theory, > to put OSS0 in a very hot situation. > > That said, I wonder how efficient a solution setting the stripe > offset of the root of the file system to -1 ("random") is to solving > this theoretical situation (given my understanding of striping under > lustre).Well, that is already the default, unless it has been changed at some time in the past by someone at your site. We generally recommend against ever changing the starting index of files, since there are rarely good reasons to change this. The man page writes: A start-ost of -1 allows the MDS to choose the starting index and it is strongly recommended, as this allows space and load balancing to be done by the MDS as needed.> In reality, we have a quite varied workload on our file systems > with codes ranging from bio to astrophys and, as such, writes > ranging from very small to very large. Any real-world experience > with these situations? Are there strange inefficiencies or > administrative difficulties that should be known previous to > enabling "random" offsets? Any info would be greatly appreciated.It isn''t random, specifically to avoid the case of non-uniform distribution when many clients are creating files at one time. With random stripe-0 OST selection, it is inevitable that some OSTs get one or two more objects, and some OSTs get one or two fewer objects, and this can cause dramatic performance impacts. For example, if the average objects per OST is 2, but some OSTs get 4 objects and others get no objects then the application may see an aggregate performance drop of 50% or more, if it were using random object distribution. With round-robin distribution, every OST will get 2 objects (assuming objects / OSTs is a whole number). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
We are running lustre 1.6 amounting to 12TB of space. We use stripe offset and stripe count as ''-1'' and stripe size of 2MB . Data on the filesystem comprises of very small to very large files. Some days back we observed write failure on the fs inspite of having 1.2TB space available (as given by df ). The problem was that 2 of the OSTs were 100% full. So can we conclude that more than often the said 2 OSTs were choosen as start offset for files that were smaller in size(<= 2MB)? On Wed, Nov 25, 2009 at 2:59 AM, Andreas Dilger <adilger at sun.com> wrote:> On 2009-11-24, at 12:17, John White wrote: > > So I''m trying to get a theoretical understanding of stripe offsets > > in lustre. As I understand it, the default offset set to 0 results > > in all writes beginning at OSS0-OST0. With a default stripe of 4, > > doesn''t this lead to massive hotspots on OSS0-OST[0-3] (unless *all* > > writes are consistently large)? > > As previously mentioned, the default is NOT to always start files with > OST0, but rather to have a "round-robin with precession" (not random > as is commonly mentioned) so that the OST used for stripe 0 of each > file is evenly distributed among OSTs, regardless of the stripe count. > > > With our setup, we have 4 OSTs per OSS (well, the last OSS has 3, > > but that''s not important right now). This would appear, in theory, > > to put OSS0 in a very hot situation. > > > > That said, I wonder how efficient a solution setting the stripe > > offset of the root of the file system to -1 ("random") is to solving > > this theoretical situation (given my understanding of striping under > > lustre). > > Well, that is already the default, unless it has been changed at some > time in the past by someone at your site. We generally recommend > against ever changing the starting index of files, since there are > rarely good reasons to change this. The man page writes: > > A start-ost of -1 allows the MDS to choose the starting > index and it is strongly recommended, as this allows > space and load balancing to be done by the MDS as needed. > > > In reality, we have a quite varied workload on our file systems > > with codes ranging from bio to astrophys and, as such, writes > > ranging from very small to very large. Any real-world experience > > with these situations? Are there strange inefficiencies or > > administrative difficulties that should be known previous to > > enabling "random" offsets? Any info would be greatly appreciated. > > > It isn''t random, specifically to avoid the case of non-uniform > distribution when many clients are creating files at one time. With > random stripe-0 OST selection, it is inevitable that some OSTs get one > or two more objects, and some OSTs get one or two fewer objects, and > this can cause dramatic performance impacts. > > For example, if the average objects per OST is 2, but some OSTs get 4 > objects and others get no objects then the application may see an > aggregate performance drop of 50% or more, if it were using random > object distribution. With round-robin distribution, every OST will > get 2 objects (assuming objects / OSTs is a whole number). > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Regards-- Rishi Pathak National PARAM Supercomputing Facility Center for Development of Advanced Computing(C-DAC) Pune University Campus,Ganesh Khind Road Pune-Maharastra -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091125/9090d6e4/attachment.html
On 2009-11-25, at 04:12, rishi pathak wrote:> We are running lustre 1.6 amounting to 12TB of space. We use stripe > offset and stripe count as ''-1'' and stripe size of 2MB .Using stripe_count = -1 means "always stripe over all OSTs".> Data on the filesystem comprises of very small to very large files. > Some days back we observed write failure on the fs inspite of having > 1.2TB space available (as given by df ). The problem was that 2 of > the OSTs were 100% full.That is because you are requesting all files to be stored on all OSTs. If the OSTs are not the same size this can cause such problems. Also, if the OSTs are "small" then it is more likely that one or the other will be filled before the others. As a general rule, it is better to have as large OSTs as possible, rather than having more small OSTs.> So can we conclude that more than often the said 2 OSTs were choosen > as start offset for files that were smaller in size(<= 2MB)?At the time the files are created, Lustre cannot know what size they will be. When there is unbalanced space usage like this it usually means that there are a small number of very large files that are causing the OSTs to be filled. There is also ongoing work to improve the allocator so that it will do a better job to continually balance space usage across all OSTs, rather than only starting the rebalancing when the space usage is very different between OSTs.> On Wed, Nov 25, 2009 at 2:59 AM, Andreas Dilger <adilger at sun.com> > wrote: > On 2009-11-24, at 12:17, John White wrote: > > So I''m trying to get a theoretical understanding of stripe offsets > > in lustre. As I understand it, the default offset set to 0 results > > in all writes beginning at OSS0-OST0. With a default stripe of 4, > > doesn''t this lead to massive hotspots on OSS0-OST[0-3] (unless *all* > > writes are consistently large)? > > As previously mentioned, the default is NOT to always start files with > OST0, but rather to have a "round-robin with precession" (not random > as is commonly mentioned) so that the OST used for stripe 0 of each > file is evenly distributed among OSTs, regardless of the stripe count. > > > With our setup, we have 4 OSTs per OSS (well, the last OSS > has 3, > > but that''s not important right now). This would appear, in theory, > > to put OSS0 in a very hot situation. > > > > That said, I wonder how efficient a solution setting the > stripe > > offset of the root of the file system to -1 ("random") is to solving > > this theoretical situation (given my understanding of striping under > > lustre). > > Well, that is already the default, unless it has been changed at some > time in the past by someone at your site. We generally recommend > against ever changing the starting index of files, since there are > rarely good reasons to change this. The man page writes: > > A start-ost of -1 allows the MDS to choose the starting > index and it is strongly recommended, as this allows > space and load balancing to be done by the MDS as needed. > > > In reality, we have a quite varied workload on our file > systems > > with codes ranging from bio to astrophys and, as such, writes > > ranging from very small to very large. Any real-world experience > > with these situations? Are there strange inefficiencies or > > administrative difficulties that should be known previous to > > enabling "random" offsets? Any info would be greatly appreciated. > > > It isn''t random, specifically to avoid the case of non-uniform > distribution when many clients are creating files at one time. With > random stripe-0 OST selection, it is inevitable that some OSTs get one > or two more objects, and some OSTs get one or two fewer objects, and > this can cause dramatic performance impacts. > > For example, if the average objects per OST is 2, but some OSTs get 4 > objects and others get no objects then the application may see an > aggregate performance drop of 50% or more, if it were using random > object distribution. With round-robin distribution, every OST will > get 2 objects (assuming objects / OSTs is a whole number). > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > -- > Regards-- > Rishi Pathak > National PARAM Supercomputing Facility > Center for Development of Advanced Computing(C-DAC) > Pune University Campus,Ganesh Khind Road > Pune-MaharastraCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
[ ... ] jwhite> In reality, we have a quite varied workload on our file jwhite> systems with codes ranging from bio to astrophys and, as jwhite> such, writes ranging from very small to very large. Good luck with that -- there is no file system design that can cope *well* with all of that. Of course every vendor will tell you otherwise abutn their product :-). Also, there is no filesystem (as in -- file system instance, actual storage pool) that can cope well with all that either. jwhite> Any real-world experience with these situations? Are jwhite> there strange inefficiencies or administrative jwhite> difficulties that should be known [ ... ] Well, following on from the previous point, my recommendation would be to consider using different tools for different purposes, or at the very least different (storage) pools for different purposes. Lustre is a good candidate for the "silver bullet" sell because it is indeed pretty decent all-round, and there isn''t much else (perhaps GlusterFS) in the same class, as NFS and SMB and OpenAFS all have some greater limitations (especially for Linux). Also it has an underappreciated aspect: it is pretty decent even as a non-cluster (e.g. one MDT/some OSTs a single server) filesystem, because it is based on something close to ''ext4'', which is ok-ish, and because its network protocol is fairly ok and with arguably a better design than the NFS or SMB protocol and quite importantly a better client implementation than the Linux NFS or SMB clients. But even if using Lustre for many sorts of things, I would not use a single giant storage pool anyhow. Sure "management" consultants push the sweet drug of consolidation and virtualization, but taken to an extreme it creates a lot of problems, as a single immense storage pools becomes rigid and hard to manage. However in many cases "management" can be easily sold on the "silver bullet" theory that a single tool and pool can support every requirement and the so called "reality principle" need not apply. If your case is not like that one possible change in your situation is to use Lustre for a lot of different applications, but have several different storage pools with rather different storage backends and tuning.
Andreas Dilger wrote:> Well, that is already the default, unless it has been changed at some > time in the past by someone at your site. We generally recommend > against ever changing the starting index of files, since there are > rarely good reasons to change this. The man page writes: > > A start-ost of -1 allows the MDS to choose the starting > index and it is strongly recommended, as this allows > space and load balancing to be done by the MDS as needed.The Lustre Manual should be updated to use that wording. It still says "random": http://manual.lustre.org/manual/LustreManual18_HTML/StripingAndIOOptions.html#50532485_78664 Also, it lists the default stripe-count as 1 when it should be 2. Also, he might want to be aware that that round-robin is only used if no two OSTs are imbalanced by more than 20%. Otherwise, the weighted allocator kicks in: http://manual.lustre.org/manual/LustreManual18_HTML/StripingAndIOOptions.html#50532485_pgfId-1293986 We haven''t had time to look into it very closely yet, but we have been getting complaints from users that seem to be a result of the weighted allocator. It appears to not be uncommon for OSTs to get more than 20% out of balance on our systems, so the weighted allocator is in use fairly frequently. The users are complaining of reduced filesystem bandwidth, and we suspect the weighted allocator. It results in the users'' files being quite unevenly distributed among the OSTs. Obviously, this is done purposely, with files more likely to be created on OSTs that have more free space. But it also results in an unbalanced distribution of files, and therefore poor bandwidth. We would probably prefer a simpler algorithm. Possibly just stop creating new files on any OST that is 20% more full, and round-robin over the remaining osts. Like I said, we haven''t had time to look into it too closely, so we don''t have a bug open yet. But it is something to keep in mind. Chris
On 2009-12-29, at 13:51, Christopher J. Morrone wrote:> Andreas Dilger wrote: >> Well, that is already the default, unless it has been changed at >> some time in the past by someone at your site. We generally >> recommend against ever changing the starting index of files, since >> there are rarely good reasons to change this. The man page writes: >> A start-ost of -1 allows the MDS to choose the starting >> index and it is strongly recommended, as this allows >> space and load balancing to be done by the MDS as needed. > > The Lustre Manual should be updated to use that wording. It still > says "random": > > http://manual.lustre.org/manual/LustreManual18_HTML/StripingAndIOOptions.html#50532485_78664 > > Also, it lists the default stripe-count as 1 when it should be 2.AFAIK, the default stripe count is still 1. Is it possible you''ve changed this default locally?> Also, he might want to be aware that that round-robin is only used > if no two OSTs are imbalanced by more than 20%. Otherwise, the > weighted allocator kicks in: > > http://manual.lustre.org/manual/LustreManual18_HTML/StripingAndIOOptions.html#50532485_pgfId-1293986Right.> We haven''t had time to look into it very closely yet, but we have > been getting complaints from users that seem to be a result of the > weighted allocator. It appears to not be uncommon for OSTs to get > more than 20% out of balance on our systems, so the weighted > allocator is in use fairly frequently. > > The users are complaining of reduced filesystem bandwidth, and we > suspect the weighted allocator. It results in the users'' files > being quite unevenly distributed among the OSTs. Obviously, this is > done purposely, with files more likely to be created on OSTs that > have more free space. But it also results in an unbalanced > distribution of files, and therefore poor bandwidth. > > We would probably prefer a simpler algorithm. Possibly just stop > creating new files on any OST that is 20% more full, and round-robin > over the remaining osts. > > Like I said, we haven''t had time to look into it too closely, so we > don''t have a bug open yet. But it is something to keep in mind.There is bug 18547 that is open to track the development of an improved QOS-RR allocator. The goal is to always use round-robin allocation, but selectively skip OSTs that are too full. There would no longer be a "QOS mode" per-se, it would always be active, hopefully avoiding imbalances gently as soon as they appear, rather than letting them get too far out of balance. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.