Hi, We are testing 1.6 beta7 over IB. On a test setup (1 MDS, 1 OSS with 6 OSTs (RAID 5 each 5 drives)) we observer uneven load among the OSTs. Testing from one to 5 clients Lustre schedules evenly (one OST per client). With client count > 5 sometimes one OST is not used at all (e.g. 6, 9 clients) or the utilisation is not as expected. The FS is otherwise empty. We used IOR for testing. Following are some examples to explain the issue. 2% are overhead used for the empty FS. OST usage for 9 clients (expected 3 OSTs with 2 clients, 3 OSTs with 1 client each): 16% /mnt/test/ost0 - 2 clients 2% /mnt/test/ost1 - 0 clients *** 9% /mnt/test/ost2 - 1 clients 16% /mnt/test/ost3 - 2 clients 16% /mnt/test/ost4 - 2 clients 16% /mnt/test/ost5 - 2 clients OST usage for 12 clients (expected 6 OSTs with 2 clients each): 9% /mnt/test/ost0 - 1 clients 9% /mnt/test/ost1 - 1 clients 23% /mnt/test/ost2 - 3 clients 23% /mnt/test/ost3 - 3 clients 16% /mnt/test/ost4 - 2 clients 16% /mnt/test/ost5 - 2 clients OST usage for 15 clients (expected 4 OSTs with 3 clients, 2 OSTs with 2 clients each): 16% /mnt/test/ost0 - 2 clients 23% /mnt/test/ost1 - 3 clients 23% /mnt/test/ost2 - 3 clients 23% /mnt/test/ost3 - 3 clients 23% /mnt/test/ost4 - 3 clients 9% /mnt/test/ost5 - 1 clients *** OST usage for 18 clients (expected 6 OSTs with 3 clients each): 23% /mnt/test/ost0 - 3 clients 29% /mnt/test/ost1 - 4 clients 29% /mnt/test/ost2 - 4 clients 23% /mnt/test/ost3 - 3 clients 16% /mnt/test/ost4 - 2 clients 16% /mnt/test/ost5 - 2 clients The behavior can be reproduced. Uneven OST utilisation leads to lower than possible performance. How can we achieve better distribution over the OSTs without manual assignment? Is there a setting to have a round-robin scheduling for the OST to use? --- Stripe setting: We want to have very high performance to a single client by striping over 6 OSTs. What parameters should be adjusted to achieve optimal performance? Thanks, Mirko
Mirko Benz wrote:> Hi, > > We are testing 1.6 beta7 over IB. On a test setup (1 MDS, 1 OSS with 6 > OSTs (RAID 5 each 5 drives)) we observer uneven load among the OSTs. > Testing from one to 5 clients Lustre schedules evenly (one OST per > client). With client count > 5 sometimes one OST is not used at all > (e.g. 6, 9 clients) or the utilisation is not as expected. The FS is > otherwise empty. We used IOR for testing.There is never a 1:1 mapping between clients and OSTs. A round-robin algorithm is used for OST stripe selection until the OST free space differs by more than 20%. However, depending on how big the files actually are, some stripes may be mostly empty and some full. For a more complete explanation stripe assignments, see http://arch.lustre.org/index.php?title=Feature_Free_Space_Management> > The behavior can be reproduced. Uneven OST utilisation leads to lower > than possible performance. How can we achieve better distribution over > the OSTs without manual assignment? > Is there a setting to have a round-robin scheduling for the OST to use?As explained above.> --- > Stripe setting: > We want to have very high performance to a single client by striping > over 6 OSTs. > What parameters should be adjusted to achieve optimal performance?Set the default stripe count to 6: lctl> conf_param <fsname>-MDT0000.lov.stripecount=6> > Thanks, > Mirko > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
Hi Nathaniel, When using 6 clients (IOR test) with 6 OSTs on a single OSS -- why does Lustre use only 5 OSTs (4 for 1 client, 1 for two clients, 1 empty OST)? The FS is empty -- round robin should be used. The OST size is 100 GB, file size is 5 GB -- there is no difference by more than 20%. Regards, Mirko Nathaniel Rutman schrieb:> Mirko Benz wrote: >> Hi, >> >> We are testing 1.6 beta7 over IB. On a test setup (1 MDS, 1 OSS with 6 >> OSTs (RAID 5 each 5 drives)) we observer uneven load among the OSTs. >> Testing from one to 5 clients Lustre schedules evenly (one OST per >> client). With client count > 5 sometimes one OST is not used at all >> (e.g. 6, 9 clients) or the utilisation is not as expected. The FS is >> otherwise empty. We used IOR for testing. > There is never a 1:1 mapping between clients and OSTs. A round-robin > algorithm is used for OST stripe selection until the OST free space > differs by more than 20%. However, depending on how big the files > actually are, some stripes may be mostly empty and some full. For a more > complete explanation stripe assignments, see > http://arch.lustre.org/index.php?title=Feature_Free_Space_Management >> >> The behavior can be reproduced. Uneven OST utilisation leads to lower >> than possible performance. How can we achieve better distribution over >> the OSTs without manual assignment? >> Is there a setting to have a round-robin scheduling for the OST to use? > As explained above. >> --- >> Stripe setting: >> We want to have very high performance to a single client by striping >> over 6 OSTs. >> What parameters should be adjusted to achieve optimal performance? > Set the default stripe count to 6: > lctl> conf_param <fsname>-MDT0000.lov.stripecount=6 > >> >> Thanks, >> Mirko >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss@clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >>
Mirko Benz wrote:> Hi Nathaniel, > > When using 6 clients (IOR test) with 6 OSTs on a single OSS -- why > does Lustre use only 5 OSTs (4 for 1 client, 1 for two clients, 1 > empty OST)? The FS is empty -- round robin should be used. The OST > size is 100 GB, file size is 5 GB -- there is no difference by more > than 20%. >> OSTCOUNT=6 sh llmount.sh > cat /proc/fs/lustre/lov/lustre-mdtlov/target_obd > for FILE in `seq 1 60`; do cp /etc/termcap /mnt/lustre/file$FILE done > ../utils/lfs getstripe /mnt/lustre | grep 0x 0 2 0x2 0 1 2 0x2 0 2 2 0x2 0 3 2 0x2 0 4 2 0x2 0 5 2 0x2 0 0 3 0x3 0 2 3 0x3 0 3 3 0x3 0 4 3 0x3 0 5 3 0x3 0 0 4 0x4 0 1 3 0x3 0 3 4 0x4 0 4 4 0x4 0 5 4 0x4 0 0 5 0x5 0 1 4 0x4 0 2 4 0x4 0 4 5 0x5 0 5 5 0x5 0 0 6 0x6 0 1 5 0x5 0 2 5 0x5 0 3 5 0x5 0 5 6 0x6 0 0 7 0x7 0 1 6 0x6 0 2 6 0x6 0 3 6 0x6 0 4 6 0x6 0 0 8 0x8 0 1 7 0x7 0 2 7 0x7 0 3 7 0x7 0 4 7 0x7 0 5 7 0x7 0 1 8 0x8 0 2 8 0x8 0 3 8 0x8 0 4 8 0x8 0 5 8 0x8 0 0 9 0x9 0 2 9 0x9 0 3 9 0x9 0 4 9 0x9 0 5 9 0x9 0 0 10 0xa 0 1 9 0x9 0 3 10 0xa 0 4 10 0xa 0 5 10 0xa 0 0 11 0xb 0 1 10 0xa 0 2 10 0xa 0 4 11 0xb 0 5 11 0xb 0 0 12 0xc 0 1 11 0xb 0 2 11 0xb 0 You can see from these results: 1. The object count on each OST is roughly the same (10 objects (starting with 0x2 ending with 0xb) 2. The objects are created in order (0-5) 3. After every ostcount+1 objects we skip an OST. This causes our "starting point" to precess around, eliminating some degenerate cases where applications that create very regular file creation/striping patterns would have preferentially used a particular OST in the sequence. I can only suggest that if you want very fine control over where files are placed, that you use the ''lfs setstripe'' command and set explicit starting OSTs. If you have a simple reproducer I would be glad to look at the results.> Regards, > Mirko > > Nathaniel Rutman schrieb: >> Mirko Benz wrote: >>> Hi, >>> >>> We are testing 1.6 beta7 over IB. On a test setup (1 MDS, 1 OSS with >>> 6 OSTs (RAID 5 each 5 drives)) we observer uneven load among the >>> OSTs. Testing from one to 5 clients Lustre schedules evenly (one OST >>> per client). With client count > 5 sometimes one OST is not used at >>> all (e.g. 6, 9 clients) or the utilisation is not as expected. The >>> FS is otherwise empty. We used IOR for testing. >> There is never a 1:1 mapping between clients and OSTs. A round-robin >> algorithm is used for OST stripe selection until the OST free space >> differs by more than 20%. However, depending on how big the files >> actually are, some stripes may be mostly empty and some full. For a >> more complete explanation stripe assignments, see >> http://arch.lustre.org/index.php?title=Feature_Free_Space_Management >>> >>> The behavior can be reproduced. Uneven OST utilisation leads to >>> lower than possible performance. How can we achieve better >>> distribution over the OSTs without manual assignment? >>> Is there a setting to have a round-robin scheduling for the OST to use? >> As explained above. >>> --- >>> Stripe setting: >>> We want to have very high performance to a single client by striping >>> over 6 OSTs. >>> What parameters should be adjusted to achieve optimal performance? >> Set the default stripe count to 6: >> lctl> conf_param <fsname>-MDT0000.lov.stripecount=6 >> >>> >>> Thanks, >>> Mirko >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss@clusterfs.com >>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >>> >