thr3ads.net - Lustre discuss - [Lustre-discuss] Metadata performance (again) [May 2007]

If this information is useful, please help other people find it:
Share via:

Daniel Leaberry

2007-May-14 16:17 UTC

[Lustre-discuss] Metadata performance (again)

I''ve closely followed the metadata mailing list posts over the last 
year. We''ve been running our small filesystem for a couple of months in
semi-production mode. We don''t have a traditional HPC workload
(it''s big
image files with 5-10 small xml files) and we knew that lustre didn''t 
excel at small files.

I ordered the beefiest MDS I could (quad proc dual core opterons with 
16GB ram) and put it on the fastest array I could (14 drive raid 10 with 
15k rpm disks). Still, as always, I''m wondering if I can do better.

Everything runs over tcp/ip with no jumbo frames. My standard test is to 
simply track how many opens we do every 10 seconds. I run the following 
command and keep track of the results. We never really exceed ~2000 
opens/second. Our workload often involves downloading ~50000 small 
(4-10k) xml files as fast as possible.

I''m just interested in what other lustre gurus have to say about my 
results. I''ve tested different lru_size amounts (makes little 
difference) and portals debug is off. My understanding is that the 
biggest performance increase I would see is moving to infiniband instead 
of tcp interconnects.

Thanks, 

[root@lu-mds01 lustre]# echo 0 > 
/proc/fs/lustre/mds/lustre1-MDT0000/stats; sleep 10; cat 
/proc/fs/lustre/mds/lustre1-MDT0000/stats
snapshot_time             1179180513.905326 secs.usecs
open                      14948 samples [reqs]
close                     7456 samples [reqs]
mknod                     0 samples [reqs]
link                      0 samples [reqs]
unlink                    110 samples [reqs]
mkdir                     0 samples [reqs]
rmdir                     0 samples [reqs]
rename                    99 samples [reqs]
getxattr                  0 samples [reqs]
setxattr                  0 samples [reqs]
iocontrol                 0 samples [reqs]
get_info                  0 samples [reqs]
set_info_async            0 samples [reqs]
attach                    0 samples [reqs]
detach                    0 samples [reqs]
setup                     0 samples [reqs]
precleanup                0 samples [reqs]
cleanup                   0 samples [reqs]
process_config            0 samples [reqs]
postrecov                 0 samples [reqs]
add_conn                  0 samples [reqs]
del_conn                  0 samples [reqs]
connect                   0 samples [reqs]
reconnect                 0 samples [reqs]
disconnect                0 samples [reqs]
statfs                    27 samples [reqs]
statfs_async              0 samples [reqs]
packmd                    0 samples [reqs]
unpackmd                  0 samples [reqs]
checkmd                   0 samples [reqs]
preallocate               0 samples [reqs]
create                    0 samples [reqs]
destroy                   0 samples [reqs]
setattr                   389 samples [reqs]
setattr_async             0 samples [reqs]
getattr                   3467 samples [reqs]
getattr_async             0 samples [reqs]
brw                       0 samples [reqs]
brw_async                 0 samples [reqs]
prep_async_page           0 samples [reqs]
queue_async_io            0 samples [reqs]
queue_group_io            0 samples [reqs]
trigger_group_io          0 samples [reqs]
set_async_flags           0 samples [reqs]
teardown_async_page       0 samples [reqs]
merge_lvb                 0 samples [reqs]
adjust_kms                0 samples [reqs]
punch                     0 samples [reqs]
sync                      0 samples [reqs]
migrate                   0 samples [reqs]
copy                      0 samples [reqs]
iterate                   0 samples [reqs]
preprw                    0 samples [reqs]
commitrw                  0 samples [reqs]
enqueue                   0 samples [reqs]
match                     0 samples [reqs]
change_cbdata             0 samples [reqs]
cancel                    0 samples [reqs]
cancel_unused             0 samples [reqs]
join_lru                  0 samples [reqs]
init_export               0 samples [reqs]
destroy_export            0 samples [reqs]
extent_calc               0 samples [reqs]
llog_init                 0 samples [reqs]
llog_finish               0 samples [reqs]
pin                       0 samples [reqs]
unpin                     0 samples [reqs]
import_event              0 samples [reqs]
notify                    0 samples [reqs]
health_check              0 samples [reqs]
quotacheck                0 samples [reqs]
quotactl                  0 samples [reqs]
ping                      0 samples [reqs]
 

-- 
Daniel Leaberry
Systems Administrator
iArchives
Tel: 801-494-6528
Cell: 801-376-6411

Andreas Dilger

2007-May-15 03:48 UTC

head link

[Lustre-discuss] Metadata performance (again)

On May 14, 2007  16:17 -0600, Daniel Leaberry wrote:> I ordered the beefiest MDS I could (quad proc dual core opterons with 
> 16GB ram) and put it on the fastest array I could (14 drive raid 10 with 
> 15k rpm disks). Still, as always, I''m wondering if I can do
better.
You should get a noticable boost when moving to 1.6.  It has much better
internal locking for the DLM to scale on many-core systems.
> Our workload often involves downloading ~50000 small 
> (4-10k) xml files as fast as possible.
How many OSTs do you have, and what is the OSC precreate count on the MDS?
If the MDS isn''t precreating at least 50000 objects in advance of your
"burst" it may be waiting for objects from the OSTs.
> My understanding is that the 
> biggest performance increase I would see is moving to infiniband instead 
> of tcp interconnects.
Yes, that would likely be a noticable win.  The TCP transaction rate is
limited to about 5k/sec, while IB can do 20k/sec or more.  You might get
a boost by having multiple TCP connections into the MDS also.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

pauln

2007-May-15 11:10 UTC

head link

[Lustre-discuss] Metadata performance (again)

Daniel,
I''ve attached a spreadsheet containing data from a lustre create test 
which I ran some time ago.  The purpose of the test was to determine how 
different hardware configs affected create performance.  As you''ll see 
from the data, the ost is actually the slowest component in the create 
chain.  I tested several OST and MDS configs and found that every 
disk-based OST configuration was susceptible to lengthy operation times 
interspersed throughout the test.  This periodic slowness was correlated 
with disk activity on the OST - at the time I suspected that the 
activity was on behalf of the journal.  Moving the entire OST onto a 
ramdisk increased the performance substantially. 

What I did not try was moving only the journal into a ramdisk.. it''s 
possible that this will decrease the frequency of the OST''s slow 
operations.  If that is the case, you may be able to increase your 
create performance by purchasing a solid-state device for your ost 
journals. 

There are also some numbers for IB (openibnld) included in the 
spreadsheet.  I found that using IB lowered the operation time from 100 
- 200 usecs.  So it''s true that switching to IB will speed things up.

I''ve also attached the raw data from Test1 (config1 and config13). 
Each
line of raw data is the operation latency in seconds, operation number 
== line number.
paul


Daniel Leaberry wrote:> I''ve closely followed the metadata mailing list posts over the
last
> year. We''ve been running our small filesystem for a couple of
months
> in semi-production mode. We don''t have a traditional HPC workload 
> (it''s big image files with 5-10 small xml files) and we knew that 
> lustre didn''t excel at small files.
>
> I ordered the beefiest MDS I could (quad proc dual core opterons with 
> 16GB ram) and put it on the fastest array I could (14 drive raid 10 
> with 15k rpm disks). Still, as always, I''m wondering if I can do
better.
>
> Everything runs over tcp/ip with no jumbo frames. My standard test is 
> to simply track how many opens we do every 10 seconds. I run the 
> following command and keep track of the results. We never really 
> exceed ~2000 opens/second. Our workload often involves downloading 
> ~50000 small (4-10k) xml files as fast as possible.
>
> I''m just interested in what other lustre gurus have to say about
my
> results. I''ve tested different lru_size amounts (makes little 
> difference) and portals debug is off. My understanding is that the 
> biggest performance increase I would see is moving to infiniband 
> instead of tcp interconnects.
>
> Thanks,
> [root@lu-mds01 lustre]# echo 0 > 
> /proc/fs/lustre/mds/lustre1-MDT0000/stats; sleep 10; cat 
> /proc/fs/lustre/mds/lustre1-MDT0000/stats
> snapshot_time             1179180513.905326 secs.usecs
> open                      14948 samples [reqs]
> close                     7456 samples [reqs]
> mknod                     0 samples [reqs]
> link                      0 samples [reqs]
> unlink                    110 samples [reqs]
> mkdir                     0 samples [reqs]
> rmdir                     0 samples [reqs]
> rename                    99 samples [reqs]
> getxattr                  0 samples [reqs]
> setxattr                  0 samples [reqs]
> iocontrol                 0 samples [reqs]
> get_info                  0 samples [reqs]
> set_info_async            0 samples [reqs]
> attach                    0 samples [reqs]
> detach                    0 samples [reqs]
> setup                     0 samples [reqs]
> precleanup                0 samples [reqs]
> cleanup                   0 samples [reqs]
> process_config            0 samples [reqs]
> postrecov                 0 samples [reqs]
> add_conn                  0 samples [reqs]
> del_conn                  0 samples [reqs]
> connect                   0 samples [reqs]
> reconnect                 0 samples [reqs]
> disconnect                0 samples [reqs]
> statfs                    27 samples [reqs]
> statfs_async              0 samples [reqs]
> packmd                    0 samples [reqs]
> unpackmd                  0 samples [reqs]
> checkmd                   0 samples [reqs]
> preallocate               0 samples [reqs]
> create                    0 samples [reqs]
> destroy                   0 samples [reqs]
> setattr                   389 samples [reqs]
> setattr_async             0 samples [reqs]
> getattr                   3467 samples [reqs]
> getattr_async             0 samples [reqs]
> brw                       0 samples [reqs]
> brw_async                 0 samples [reqs]
> prep_async_page           0 samples [reqs]
> queue_async_io            0 samples [reqs]
> queue_group_io            0 samples [reqs]
> trigger_group_io          0 samples [reqs]
> set_async_flags           0 samples [reqs]
> teardown_async_page       0 samples [reqs]
> merge_lvb                 0 samples [reqs]
> adjust_kms                0 samples [reqs]
> punch                     0 samples [reqs]
> sync                      0 samples [reqs]
> migrate                   0 samples [reqs]
> copy                      0 samples [reqs]
> iterate                   0 samples [reqs]
> preprw                    0 samples [reqs]
> commitrw                  0 samples [reqs]
> enqueue                   0 samples [reqs]
> match                     0 samples [reqs]
> change_cbdata             0 samples [reqs]
> cancel                    0 samples [reqs]
> cancel_unused             0 samples [reqs]
> join_lru                  0 samples [reqs]
> init_export               0 samples [reqs]
> destroy_export            0 samples [reqs]
> extent_calc               0 samples [reqs]
> llog_init                 0 samples [reqs]
> llog_finish               0 samples [reqs]
> pin                       0 samples [reqs]
> unpin                     0 samples [reqs]
> import_event              0 samples [reqs]
> notify                    0 samples [reqs]
> health_check              0 samples [reqs]
> quotacheck                0 samples [reqs]
> quotactl                  0 samples [reqs]
> ping                      0 samples [reqs]
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Lustre_Metadata_Create_Test.csv
Type: text/x-comma-separated-values
Size: 8726 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070515/79f81225/Lustre_Metadata_Create_Test-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config_1-1pe_create.dat.bz2
Type: application/x-bzip
Size: 138603 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070515/79f81225/config_1-1pe_create.dat-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config_13-1pe_create.dat.bz2
Type: application/x-bzip
Size: 148722 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070515/79f81225/config_13-1pe_create.dat-0001.bin

Oral, H. Sarp

2007-May-15 13:42 UTC

head link

[Lustre-discuss] Metadata performance (again)

Daniel,

We had similar findings with PSC. Moving from TCP to IB (OFED 1.1)
client-to-server connections almost doubled the metadata ops performance
for single client cases. 

Sarp Oral

-----------------

Sarp Oral, Ph.D.
NCCS ORNL
865-574-2173



-----Original Message-----
From: lustre-discuss-bounces@clusterfs.com
[mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of pauln
Sent: Tuesday, May 15, 2007 1:04 PM
To: Daniel Leaberry
Cc: lustre-discuss@clusterfs.com
Subject: Re: [Lustre-discuss] Metadata performance (again)

Daniel,
I''ve attached a spreadsheet containing data from a lustre create test 
which I ran some time ago.  The purpose of the test was to determine how

different hardware configs affected create performance.  As you''ll see 
from the data, the ost is actually the slowest component in the create 
chain.  I tested several OST and MDS configs and found that every 
disk-based OST configuration was susceptible to lengthy operation times 
interspersed throughout the test.  This periodic slowness was correlated

with disk activity on the OST - at the time I suspected that the 
activity was on behalf of the journal.  Moving the entire OST onto a 
ramdisk increased the performance substantially. 

What I did not try was moving only the journal into a ramdisk.. it''s 
possible that this will decrease the frequency of the OST''s slow 
operations.  If that is the case, you may be able to increase your 
create performance by purchasing a solid-state device for your ost 
journals. 

There are also some numbers for IB (openibnld) included in the 
spreadsheet.  I found that using IB lowered the operation time from 100 
- 200 usecs.  So it''s true that switching to IB will speed things up.

I''ve also attached the raw data from Test1 (config1 and config13). 
Each

line of raw data is the operation latency in seconds, operation number 
== line number.
paul


Daniel Leaberry wrote:> I''ve closely followed the metadata mailing list posts over the
last
> year. We''ve been running our small filesystem for a couple of
months
> in semi-production mode. We don''t have a traditional HPC workload 
> (it''s big image files with 5-10 small xml files) and we knew that 
> lustre didn''t excel at small files.
>
> I ordered the beefiest MDS I could (quad proc dual core opterons with 
> 16GB ram) and put it on the fastest array I could (14 drive raid 10 
> with 15k rpm disks). Still, as always, I''m wondering if I can do
better.>
> Everything runs over tcp/ip with no jumbo frames. My standard test is 
> to simply track how many opens we do every 10 seconds. I run the 
> following command and keep track of the results. We never really 
> exceed ~2000 opens/second. Our workload often involves downloading 
> ~50000 small (4-10k) xml files as fast as possible.
>
> I''m just interested in what other lustre gurus have to say about
my
> results. I''ve tested different lru_size amounts (makes little 
> difference) and portals debug is off. My understanding is that the 
> biggest performance increase I would see is moving to infiniband 
> instead of tcp interconnects.
>
> Thanks,
> [root@lu-mds01 lustre]# echo 0 > 
> /proc/fs/lustre/mds/lustre1-MDT0000/stats; sleep 10; cat 
> /proc/fs/lustre/mds/lustre1-MDT0000/stats
> snapshot_time             1179180513.905326 secs.usecs
> open                      14948 samples [reqs]
> close                     7456 samples [reqs]
> mknod                     0 samples [reqs]
> link                      0 samples [reqs]
> unlink                    110 samples [reqs]
> mkdir                     0 samples [reqs]
> rmdir                     0 samples [reqs]
> rename                    99 samples [reqs]
> getxattr                  0 samples [reqs]
> setxattr                  0 samples [reqs]
> iocontrol                 0 samples [reqs]
> get_info                  0 samples [reqs]
> set_info_async            0 samples [reqs]
> attach                    0 samples [reqs]
> detach                    0 samples [reqs]
> setup                     0 samples [reqs]
> precleanup                0 samples [reqs]
> cleanup                   0 samples [reqs]
> process_config            0 samples [reqs]
> postrecov                 0 samples [reqs]
> add_conn                  0 samples [reqs]
> del_conn                  0 samples [reqs]
> connect                   0 samples [reqs]
> reconnect                 0 samples [reqs]
> disconnect                0 samples [reqs]
> statfs                    27 samples [reqs]
> statfs_async              0 samples [reqs]
> packmd                    0 samples [reqs]
> unpackmd                  0 samples [reqs]
> checkmd                   0 samples [reqs]
> preallocate               0 samples [reqs]
> create                    0 samples [reqs]
> destroy                   0 samples [reqs]
> setattr                   389 samples [reqs]
> setattr_async             0 samples [reqs]
> getattr                   3467 samples [reqs]
> getattr_async             0 samples [reqs]
> brw                       0 samples [reqs]
> brw_async                 0 samples [reqs]
> prep_async_page           0 samples [reqs]
> queue_async_io            0 samples [reqs]
> queue_group_io            0 samples [reqs]
> trigger_group_io          0 samples [reqs]
> set_async_flags           0 samples [reqs]
> teardown_async_page       0 samples [reqs]
> merge_lvb                 0 samples [reqs]
> adjust_kms                0 samples [reqs]
> punch                     0 samples [reqs]
> sync                      0 samples [reqs]
> migrate                   0 samples [reqs]
> copy                      0 samples [reqs]
> iterate                   0 samples [reqs]
> preprw                    0 samples [reqs]
> commitrw                  0 samples [reqs]
> enqueue                   0 samples [reqs]
> match                     0 samples [reqs]
> change_cbdata             0 samples [reqs]
> cancel                    0 samples [reqs]
> cancel_unused             0 samples [reqs]
> join_lru                  0 samples [reqs]
> init_export               0 samples [reqs]
> destroy_export            0 samples [reqs]
> extent_calc               0 samples [reqs]
> llog_init                 0 samples [reqs]
> llog_finish               0 samples [reqs]
> pin                       0 samples [reqs]
> unpin                     0 samples [reqs]
> import_event              0 samples [reqs]
> notify                    0 samples [reqs]
> health_check              0 samples [reqs]
> quotacheck                0 samples [reqs]
> quotactl                  0 samples [reqs]
> ping                      0 samples [reqs]
>
>

Andreas Dilger

2007-May-15 13:52 UTC

head link

[Lustre-discuss] Metadata performance (again)

On May 15, 2007  13:03 -0400, pauln wrote:> I''ve attached a spreadsheet containing data from a lustre create
test
> which I ran some time ago.  The purpose of the test was to determine how 
> different hardware configs affected create performance.  As you''ll
see
> from the data, the ost is actually the slowest component in the create 
> chain.  I tested several OST and MDS configs and found that every 
> disk-based OST configuration was susceptible to lengthy operation times 
> interspersed throughout the test.  This periodic slowness was correlated 
> with disk activity on the OST - at the time I suspected that the 
> activity was on behalf of the journal.  Moving the entire OST onto a 
> ramdisk increased the performance substantially. 
Paul,
what version of Lustre were you testing?  How large are the ext3 inodes on
the OSTs (can be seen with "dumpe2fs -h /dev/{ostdev}")?  What is the
default stripe count?

If you are running 1.4.6 and the ext3 inode size is 128 bytes then there
can be a significant performance hit due to extra metadata being stored
on the OSTs.  This is not an issue with filesystems using a newer Lustre.
> What I did not try was moving only the journal into a ramdisk..
it''s
> possible that this will decrease the frequency of the OST''s slow 
> operations.  If that is the case, you may be able to increase your 
> create performance by purchasing a solid-state device for your ost 
> journals. 
> 
> There are also some numbers for IB (openibnld) included in the 
> spreadsheet.  I found that using IB lowered the operation time from 100 
> - 200 usecs.  So it''s true that switching to IB will speed things
up.
> 
> I''ve also attached the raw data from Test1 (config1 and config13).
Each
> line of raw data is the operation latency in seconds, operation number 
> == line number.
> paul
> 
> 
> Daniel Leaberry wrote:
> >I''ve closely followed the metadata mailing list posts over the
last
> >year. We''ve been running our small filesystem for a couple of
months
> >in semi-production mode. We don''t have a traditional HPC
workload
> >(it''s big image files with 5-10 small xml files) and we knew
that
> >lustre didn''t excel at small files.
> >
> >I ordered the beefiest MDS I could (quad proc dual core opterons with 
> >16GB ram) and put it on the fastest array I could (14 drive raid 10 
> >with 15k rpm disks). Still, as always, I''m wondering if I can
do better.
> >
> >Everything runs over tcp/ip with no jumbo frames. My standard test is 
> >to simply track how many opens we do every 10 seconds. I run the 
> >following command and keep track of the results. We never really 
> >exceed ~2000 opens/second. Our workload often involves downloading 
> >~50000 small (4-10k) xml files as fast as possible.
> >
> >I''m just interested in what other lustre gurus have to say
about my
> >results. I''ve tested different lru_size amounts (makes little 
> >difference) and portals debug is off. My understanding is that the 
> >biggest performance increase I would see is moving to infiniband 
> >instead of tcp interconnects.
> >
> >Thanks,
> >[root@lu-mds01 lustre]# echo 0 > 
> >/proc/fs/lustre/mds/lustre1-MDT0000/stats; sleep 10; cat 
> >/proc/fs/lustre/mds/lustre1-MDT0000/stats
> >snapshot_time             1179180513.905326 secs.usecs
> >open                      14948 samples [reqs]
> >close                     7456 samples [reqs]
> >mknod                     0 samples [reqs]
> >link                      0 samples [reqs]
> >unlink                    110 samples [reqs]
> >mkdir                     0 samples [reqs]
> >rmdir                     0 samples [reqs]
> >rename                    99 samples [reqs]
> >getxattr                  0 samples [reqs]
> >setxattr                  0 samples [reqs]
> >iocontrol                 0 samples [reqs]
> >get_info                  0 samples [reqs]
> >set_info_async            0 samples [reqs]
> >attach                    0 samples [reqs]
> >detach                    0 samples [reqs]
> >setup                     0 samples [reqs]
> >precleanup                0 samples [reqs]
> >cleanup                   0 samples [reqs]
> >process_config            0 samples [reqs]
> >postrecov                 0 samples [reqs]
> >add_conn                  0 samples [reqs]
> >del_conn                  0 samples [reqs]
> >connect                   0 samples [reqs]
> >reconnect                 0 samples [reqs]
> >disconnect                0 samples [reqs]
> >statfs                    27 samples [reqs]
> >statfs_async              0 samples [reqs]
> >packmd                    0 samples [reqs]
> >unpackmd                  0 samples [reqs]
> >checkmd                   0 samples [reqs]
> >preallocate               0 samples [reqs]
> >create                    0 samples [reqs]
> >destroy                   0 samples [reqs]
> >setattr                   389 samples [reqs]
> >setattr_async             0 samples [reqs]
> >getattr                   3467 samples [reqs]
> >getattr_async             0 samples [reqs]
> >brw                       0 samples [reqs]
> >brw_async                 0 samples [reqs]
> >prep_async_page           0 samples [reqs]
> >queue_async_io            0 samples [reqs]
> >queue_group_io            0 samples [reqs]
> >trigger_group_io          0 samples [reqs]
> >set_async_flags           0 samples [reqs]
> >teardown_async_page       0 samples [reqs]
> >merge_lvb                 0 samples [reqs]
> >adjust_kms                0 samples [reqs]
> >punch                     0 samples [reqs]
> >sync                      0 samples [reqs]
> >migrate                   0 samples [reqs]
> >copy                      0 samples [reqs]
> >iterate                   0 samples [reqs]
> >preprw                    0 samples [reqs]
> >commitrw                  0 samples [reqs]
> >enqueue                   0 samples [reqs]
> >match                     0 samples [reqs]
> >change_cbdata             0 samples [reqs]
> >cancel                    0 samples [reqs]
> >cancel_unused             0 samples [reqs]
> >join_lru                  0 samples [reqs]
> >init_export               0 samples [reqs]
> >destroy_export            0 samples [reqs]
> >extent_calc               0 samples [reqs]
> >llog_init                 0 samples [reqs]
> >llog_finish               0 samples [reqs]
> >pin                       0 samples [reqs]
> >unpin                     0 samples [reqs]
> >import_event              0 samples [reqs]
> >notify                    0 samples [reqs]
> >health_check              0 samples [reqs]
> >quotacheck                0 samples [reqs]
> >quotactl                  0 samples [reqs]
> >ping                      0 samples [reqs]
> >
> >
> 


> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

pauln

2007-May-15 14:09 UTC

head link

[Lustre-discuss] Metadata performance (again)

Andreas Dilger wrote:> On May 15, 2007  13:03 -0400, pauln wrote:
>> I''ve attached a spreadsheet containing data from a lustre
create test
>> which I ran some time ago.  The purpose of the test was to determine
how
>> different hardware configs affected create performance.  As
you''ll see
>> from the data, the ost is actually the slowest component in the create 
>> chain.  I tested several OST and MDS configs and found that every 
>> disk-based OST configuration was susceptible to lengthy operation times
>> interspersed throughout the test.  This periodic slowness was
correlated
>> with disk activity on the OST - at the time I suspected that the 
>> activity was on behalf of the journal.  Moving the entire OST onto a 
>> ramdisk increased the performance substantially. 
>
> Paul,
> what version of Lustre were you testing?  How large are the ext3 inodes on
> the OSTs (can be seen with "dumpe2fs -h /dev/{ostdev}")?  What is
the
> default stripe count?
>
> If you are running 1.4.6 and the ext3 inode size is 128 bytes then there
> can be a significant performance hit due to extra metadata being stored
> on the OSTs.  This is not an issue with filesystems using a newer Lustre.
>The lustre version was 1.4.6.1 (rhel kernel 2.6.9-34).  I used the 
default inode size and only had 1 OST.  Can you briefly describe the 
problems with 128 byte inodes and suggest a more optimal size?
thanks,
paul>> What I did not try was moving only the journal into a ramdisk..
it''s
>> possible that this will decrease the frequency of the OST''s
slow
>> operations.  If that is the case, you may be able to increase your 
>> create performance by purchasing a solid-state device for your ost 
>> journals. 
>>
>> There are also some numbers for IB (openibnld) included in the 
>> spreadsheet.  I found that using IB lowered the operation time from 100
>> - 200 usecs.  So it''s true that switching to IB will speed
things up.
>>
>> I''ve also attached the raw data from Test1 (config1 and
config13).  Each
>> line of raw data is the operation latency in seconds, operation number 
>> == line number.
>> paul
>>
>>
>> Daniel Leaberry wrote:
>>> I''ve closely followed the metadata mailing list posts over
the last
>>> year. We''ve been running our small filesystem for a couple
of months
>>> in semi-production mode. We don''t have a traditional HPC
workload
>>> (it''s big image files with 5-10 small xml files) and we
knew that
>>> lustre didn''t excel at small files.
>>>
>>> I ordered the beefiest MDS I could (quad proc dual core opterons
with
>>> 16GB ram) and put it on the fastest array I could (14 drive raid 10
>>> with 15k rpm disks). Still, as always, I''m wondering if I
can do better.
>>>
>>> Everything runs over tcp/ip with no jumbo frames. My standard test
is
>>> to simply track how many opens we do every 10 seconds. I run the 
>>> following command and keep track of the results. We never really 
>>> exceed ~2000 opens/second. Our workload often involves downloading 
>>> ~50000 small (4-10k) xml files as fast as possible.
>>>
>>> I''m just interested in what other lustre gurus have to say
about my
>>> results. I''ve tested different lru_size amounts (makes
little
>>> difference) and portals debug is off. My understanding is that the 
>>> biggest performance increase I would see is moving to infiniband 
>>> instead of tcp interconnects.
>>>
>>> Thanks,
>>> [root@lu-mds01 lustre]# echo 0 > 
>>> /proc/fs/lustre/mds/lustre1-MDT0000/stats; sleep 10; cat 
>>> /proc/fs/lustre/mds/lustre1-MDT0000/stats
>>> snapshot_time             1179180513.905326 secs.usecs
>>> open                      14948 samples [reqs]
>>> close                     7456 samples [reqs]
>>> mknod                     0 samples [reqs]
>>> link                      0 samples [reqs]
>>> unlink                    110 samples [reqs]
>>> mkdir                     0 samples [reqs]
>>> rmdir                     0 samples [reqs]
>>> rename                    99 samples [reqs]
>>> getxattr                  0 samples [reqs]
>>> setxattr                  0 samples [reqs]
>>> iocontrol                 0 samples [reqs]
>>> get_info                  0 samples [reqs]
>>> set_info_async            0 samples [reqs]
>>> attach                    0 samples [reqs]
>>> detach                    0 samples [reqs]
>>> setup                     0 samples [reqs]
>>> precleanup                0 samples [reqs]
>>> cleanup                   0 samples [reqs]
>>> process_config            0 samples [reqs]
>>> postrecov                 0 samples [reqs]
>>> add_conn                  0 samples [reqs]
>>> del_conn                  0 samples [reqs]
>>> connect                   0 samples [reqs]
>>> reconnect                 0 samples [reqs]
>>> disconnect                0 samples [reqs]
>>> statfs                    27 samples [reqs]
>>> statfs_async              0 samples [reqs]
>>> packmd                    0 samples [reqs]
>>> unpackmd                  0 samples [reqs]
>>> checkmd                   0 samples [reqs]
>>> preallocate               0 samples [reqs]
>>> create                    0 samples [reqs]
>>> destroy                   0 samples [reqs]
>>> setattr                   389 samples [reqs]
>>> setattr_async             0 samples [reqs]
>>> getattr                   3467 samples [reqs]
>>> getattr_async             0 samples [reqs]
>>> brw                       0 samples [reqs]
>>> brw_async                 0 samples [reqs]
>>> prep_async_page           0 samples [reqs]
>>> queue_async_io            0 samples [reqs]
>>> queue_group_io            0 samples [reqs]
>>> trigger_group_io          0 samples [reqs]
>>> set_async_flags           0 samples [reqs]
>>> teardown_async_page       0 samples [reqs]
>>> merge_lvb                 0 samples [reqs]
>>> adjust_kms                0 samples [reqs]
>>> punch                     0 samples [reqs]
>>> sync                      0 samples [reqs]
>>> migrate                   0 samples [reqs]
>>> copy                      0 samples [reqs]
>>> iterate                   0 samples [reqs]
>>> preprw                    0 samples [reqs]
>>> commitrw                  0 samples [reqs]
>>> enqueue                   0 samples [reqs]
>>> match                     0 samples [reqs]
>>> change_cbdata             0 samples [reqs]
>>> cancel                    0 samples [reqs]
>>> cancel_unused             0 samples [reqs]
>>> join_lru                  0 samples [reqs]
>>> init_export               0 samples [reqs]
>>> destroy_export            0 samples [reqs]
>>> extent_calc               0 samples [reqs]
>>> llog_init                 0 samples [reqs]
>>> llog_finish               0 samples [reqs]
>>> pin                       0 samples [reqs]
>>> unpin                     0 samples [reqs]
>>> import_event              0 samples [reqs]
>>> notify                    0 samples [reqs]
>>> health_check              0 samples [reqs]
>>> quotacheck                0 samples [reqs]
>>> quotactl                  0 samples [reqs]
>>> ping                      0 samples [reqs]
>>>
>>>
>
>
>
>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss@clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Andreas Dilger

2007-May-15 16:12 UTC

head link

[Lustre-discuss] Metadata performance (again)

On May 15, 2007  16:03 -0400, pauln wrote:> Andreas Dilger wrote:
> >On May 15, 2007  13:03 -0400, pauln wrote:
> >>I''ve attached a spreadsheet containing data from a lustre
create test
> >>which I ran some time ago.  The purpose of the test was to
determine how
> >>different hardware configs affected create performance.  As
you''ll see
> >>from the data, the ost is actually the slowest component in the
create
> >>chain.  I tested several OST and MDS configs and found that every 
> >>disk-based OST configuration was susceptible to lengthy operation
times
> >>interspersed throughout the test.  This periodic slowness was
correlated
> >>with disk activity on the OST - at the time I suspected that the 
> >>activity was on behalf of the journal.  Moving the entire OST onto
a
> >>ramdisk increased the performance substantially. 
> >
> >what version of Lustre were you testing?  How large are the ext3 inodes
on
> >the OSTs (can be seen with "dumpe2fs -h /dev/{ostdev}")? 
What is the
> >default stripe count?
> >
> >If you are running 1.4.6 and the ext3 inode size is 128 bytes then
there
> >can be a significant performance hit due to extra metadata being stored
> >on the OSTs.  This is not an issue with filesystems using a newer
Lustre.
>
> The lustre version was 1.4.6.1 (rhel kernel 2.6.9-34).  I used the 
> default inode size and only had 1 OST.  Can you briefly describe the 
> problems with 128 byte inodes and suggest a more optimal size?
In 1.4.6 a feature was added to store extra EA data with each OST object
to allow object recovery in the face of serious data corruption.  If the
directory that references an object is corrupted, then the data may still
be present on the disk, but is not identifyable.  Each OST object now
keeps the object ID, along with the MDS inode number and stripe index.

It was identified that this can cause a slowdown during rapid object
creates, but for most normal uses this is not noticable because there
are multiple OSTs per filesystem, and the MDS precreates sufficient
objects to avoid a slowdown.

In 1.4.7 the default OST inode size was increased to 256 bytes to avoid
the slowdown.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Lustre discuss - May 2007 - Metadata performance (again)

[Lustre-discuss] Metadata performance (again)

[Lustre-discuss] Metadata performance (again)

[Lustre-discuss] Metadata performance (again)

[Lustre-discuss] Metadata performance (again)

[Lustre-discuss] Metadata performance (again)

[Lustre-discuss] Metadata performance (again)

[Lustre-discuss] Metadata performance (again)