thr3ads.net - Lustre discuss - [Lustre-community] Slow Directory Listing [Sep 2011]

If this information is useful, please help other people find it:
Share via:

Indivar Nair

2011-Sep-06 06:33 UTC

[Lustre-community] Slow Directory Listing

Hi ...,

I have a lustre storage that stores lots of small files i.e. hundreds to
thousand of 9MB image files.
While normal file access works fine, the directory listing is extremely
slow.
Depending on the number of files in a directory, the listing takes around 5
- 15 secs.

I tried ''ls --color=none'' and it worked fine; listed the
contents
immediately.

But that doesn''t help my cause. I have Samba Gateway Servers, and all
users
access the storage through the gateway. Double clicking on directory takes a
long long time to display.

The cluster consist of -
- two DRBD Mirrored MDS Servers (Dell R610s) with 10K RPM disks
- four OSS Nodes (2 Node Cluster (Dell R710s) with a common storage (Dell
MD3200))

The storage consists of 12 x 1TB HDDs on both arrays, in RAID 6
Configuration.

What actually happens when one does a listing like this?
What can I do to make the listing faster?
Could it be an MDS issue?
Some site suggested that this could be caused due to ''-o
flock'' switch. Is
it so?

Kindly Help.
The storage is in Production, and this is causing a lot of issues.

Regards,


Indivar Nair
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-community/attachments/20110906/a91bc43f/attachment.html

Indivar Nair

2011-Sep-06 06:43 UTC

head link

[Lustre-community] Slow Directory Listing

Hi ...,

I have a lustre storage that stores lots of small files i.e. hundreds to
thousand of 9MB image files.
While normal file access works fine, the directory listing is extremely
slow.
Depending on the number of files in a directory, the listing takes around 5
- 15 secs.

I tried ''ls --color=none'' and it worked fine; listed the
contents
immediately.

But that doesn''t help my cause. I have Samba Gateway Servers, and all
users
access the storage through the gateway. Double clicking on directory takes a
long long time to display.

The cluster consist of -
- two DRBD Mirrored MDS Servers (Dell R610s) with 10K RPM disks
- four OSS Nodes (2 Node Cluster (Dell R710s) with a common storage (Dell
MD3200))

The storage consists of 12 x 1TB HDDs on both arrays, in RAID 6
Configuration.

What actually happens when one does a listing like this?
What can I do to make the listing faster?
Could it be an MDS issue?
Some site suggested that this could be caused due to ''-o
flock'' switch. Is
it so?

Kindly Help.
The storage is in Production, and this is causing a lot of issues.

Regards,


Indivar Nair
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-community/attachments/20110906/136d337f/attachment.html

Adrian Ulrich

2011-Sep-06 19:58 UTC

head link

[Lustre-discuss] Slow Directory Listing

> While normal file access works fine, the directory listing is extremely
> slow.
> Depending on the number of files in a directory, the listing takes around 5
> - 15 secs.
> 
> I tried ''ls --color=none'' and it worked fine; listed the
contents
> immediately.
That''s because ''ls --color=always|auto'' does an
lstat() of each file (--color=none doesn''t) which causes Lustre to
send:

 - 1 RPC to the MDS per file
 - 1 RPC (per file) to EACH OSS where the file is stored to get the filesize

Some time ago i''ve created a patch to speed up ''ls''
while keeping (most) of the colors
(https://github.com/adrian-bl/patchwork/blob/master/coreutils/ls/ls-lustre.diff)

But patching samba will not be possible in your case as it really needs the
information returned by stat().

> Double clicking on directory takes a long long time to display.
Attach `strace'' to samba: It will probably be busy doing lstat() which
is a ''slow'' operation on Lustre in any case.


> The cluster consist of -
> - two DRBD Mirrored MDS Servers (Dell R610s) with 10K RPM disks
> - four OSS Nodes (2 Node Cluster (Dell R710s) with a common storage (Dell
> MD3200))
How many OSTs do you have per OSS?
What''s your stripe setting? Setting the stripe to 1 could give you a
huge speedup (without affecting normal I/O as i assume that the 9MB files are
read/written sequentially)


Regards,
 Adrian


-- 
 RFC 1925:
   (11) Every old idea will be proposed again with a different name and
        a different presentation, regardless of whether it works.

Indivar Nair

2011-Sep-07 00:21 UTC

head link

[Lustre-discuss] Slow Directory Listing

I have 1 OST Per OSS.
I have set the stripe to Lustre default. Should I still change the stripe
count to 1?
Yes, the reads are sequential.

1. Will increasing / decreasing the number of max_rpcs_in_flight help?
It is set to 32 now.

2. Just for experimenting, would it speed up listing if 1 OSS served 2 OSTs,
thereby reducing the number of OSS and RPCs?

Regards,


Indivar Nair

On Wed, Sep 7, 2011 at 1:28 AM, Adrian Ulrich <adrian at
blinkenlights.ch>wrote:
>
> > While normal file access works fine, the directory listing is
extremely
> > slow.
> > Depending on the number of files in a directory, the listing takes
around
> 5
> > - 15 secs.
> >
> > I tried ''ls --color=none'' and it worked fine; listed
the contents
> > immediately.
>
> That''s because ''ls --color=always|auto'' does an
lstat() of each file
> (--color=none doesn''t) which causes Lustre to send:
>
>  - 1 RPC to the MDS per file
>  - 1 RPC (per file) to EACH OSS where the file is stored to get the
> filesize
>
> Some time ago i''ve created a patch to speed up
''ls'' while keeping (most) of
> the colors
> (
>
https://github.com/adrian-bl/patchwork/blob/master/coreutils/ls/ls-lustre.diff
> )
>
> But patching samba will not be possible in your case as it really needs the
> information returned by stat().
>
>
> > Double clicking on directory takes a long long time to display.
>
> Attach `strace'' to samba: It will probably be busy doing lstat()
which is a
> ''slow'' operation on Lustre in any case.
>
>
>
> > The cluster consist of -
> > - two DRBD Mirrored MDS Servers (Dell R610s) with 10K RPM disks
> > - four OSS Nodes (2 Node Cluster (Dell R710s) with a common storage
(Dell
> > MD3200))
>
> How many OSTs do you have per OSS?
> What''s your stripe setting? Setting the stripe to 1 could give you
a huge
> speedup (without affecting normal I/O as i assume that the 9MB files are
> read/written sequentially)
>
>
> Regards,
>  Adrian
>
>
> --
>  RFC 1925:
>   (11) Every old idea will be proposed again with a different name and
>        a different presentation, regardless of whether it works.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110907/7859ebf8/attachment.html

Jeremy Filizetti

2011-Sep-07 02:13 UTC

head link

[Lustre-discuss] Slow Directory Listing

You''ll see a small benefit from mdc.*.statahead_max which will grab
most of
the information for the stat from the MDT, but you still pay a penalty
because the glimpse for each OSC will be synchronous (but issued
simultaneously to each OSC) IIRC.  There was some patches mentioned quite a
while back for asynchronous glimpses but I didn''t see much of a
difference
and never tried to narrow down why that was.  It also wasn''t production
ready.

Hopefully you have extended attributes disabled for your Samba servers as
well because it could be much worse if you have to listxattrs and then fetch
each one.  Every file in Lustre also has a trusted.lov or lustre.lov xattr
that stores the striping information, so for every file you''d be doing
a
listxattr and getxattr.  Quite a while back I wrote a samba module to filter
out Lustre striping xattrs but not others, which helped a little.

The last thing which doesn''t sound like it is affecting you is how
Lustre
clients read directory pages.  Today in 1.8 and 2.0 each directory is read
one page at a time (4k on x86_64).  If you have hundreds of thousands of
files your directories they will be larger then a single page.  There are
some patches already done by Whamcloud (
http://jira.whamcloud.com/browse/LU-5) that allows had the directory
readpage readahead a full bulk RPC which is 1 MB.  Depending on your
environment this could be almost a 256x speed up.  This had a huge impact
for me, but only seems significant if your not doing a stat on every file.

Jeremy

On Tue, Sep 6, 2011 at 8:21 PM, Indivar Nair <indivar.nair at
techterra.in>wrote:
> I have 1 OST Per OSS.
> I have set the stripe to Lustre default. Should I still change the stripe
> count to 1?
> Yes, the reads are sequential.
>
> 1. Will increasing / decreasing the number of max_rpcs_in_flight help?
> It is set to 32 now.
>
> 2. Just for experimenting, would it speed up listing if 1 OSS served 2
> OSTs, thereby reducing the number of OSS and RPCs?
>
> Regards,
>
>
> Indivar Nair
>
>
> On Wed, Sep 7, 2011 at 1:28 AM, Adrian Ulrich <adrian at
blinkenlights.ch>wrote:
>
>>
>> > While normal file access works fine, the directory listing is
extremely
>> > slow.
>> > Depending on the number of files in a directory, the listing takes
>> around 5
>> > - 15 secs.
>> >
>> > I tried ''ls --color=none'' and it worked fine;
listed the contents
>> > immediately.
>>
>> That''s because ''ls --color=always|auto'' does
an lstat() of each file
>> (--color=none doesn''t) which causes Lustre to send:
>>
>>  - 1 RPC to the MDS per file
>>  - 1 RPC (per file) to EACH OSS where the file is stored to get the
>> filesize
>>
>> Some time ago i''ve created a patch to speed up
''ls'' while keeping (most)
>> of the colors
>> (
>>
https://github.com/adrian-bl/patchwork/blob/master/coreutils/ls/ls-lustre.diff
>> )
>>
>> But patching samba will not be possible in your case as it really needs
>> the information returned by stat().
>>
>>
>> > Double clicking on directory takes a long long time to display.
>>
>> Attach `strace'' to samba: It will probably be busy doing
lstat() which is
>> a ''slow'' operation on Lustre in any case.
>>
>>
>>
>> > The cluster consist of -
>> > - two DRBD Mirrored MDS Servers (Dell R610s) with 10K RPM disks
>> > - four OSS Nodes (2 Node Cluster (Dell R710s) with a common
storage
>> (Dell
>> > MD3200))
>>
>> How many OSTs do you have per OSS?
>> What''s your stripe setting? Setting the stripe to 1 could give
you a huge
>> speedup (without affecting normal I/O as i assume that the 9MB files
are
>> read/written sequentially)
>>
>>
>> Regards,
>>  Adrian
>>
>>
>> --
>>  RFC 1925:
>>   (11) Every old idea will be proposed again with a different name and
>>        a different presentation, regardless of whether it works.
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110906/399dbd5a/attachment.html

Jeremy Filizetti

2011-Sep-07 02:20 UTC

head link

[Lustre-discuss] Slow Directory Listing

The statahead_max is actually at llite.*.statahead_max not mdc.

Jeremy

On Tue, Sep 6, 2011 at 10:13 PM, Jeremy Filizetti <
jeremy.filizetti at gmail.com> wrote:
> You''ll see a small benefit from mdc.*.statahead_max which will
grab most of
> the information for the stat from the MDT, but you still pay a penalty
> because the glimpse for each OSC will be synchronous (but issued
> simultaneously to each OSC) IIRC.  There was some patches mentioned quite a
> while back for asynchronous glimpses but I didn''t see much of a
difference
> and never tried to narrow down why that was.  It also wasn''t
production
> ready.
>
> Hopefully you have extended attributes disabled for your Samba servers as
> well because it could be much worse if you have to listxattrs and then
fetch
> each one.  Every file in Lustre also has a trusted.lov or lustre.lov xattr
> that stores the striping information, so for every file you''d be
doing a
> listxattr and getxattr.  Quite a while back I wrote a samba module to
filter
> out Lustre striping xattrs but not others, which helped a little.
>
> The last thing which doesn''t sound like it is affecting you is how
Lustre
> clients read directory pages.  Today in 1.8 and 2.0 each directory is read
> one page at a time (4k on x86_64).  If you have hundreds of thousands of
> files your directories they will be larger then a single page.  There are
> some patches already done by Whamcloud (
> http://jira.whamcloud.com/browse/LU-5) that allows had the directory
> readpage readahead a full bulk RPC which is 1 MB.  Depending on your
> environment this could be almost a 256x speed up.  This had a huge impact
> for me, but only seems significant if your not doing a stat on every file.
>
> Jeremy
>
> On Tue, Sep 6, 2011 at 8:21 PM, Indivar Nair <indivar.nair at
techterra.in>wrote:
>
>> I have 1 OST Per OSS.
>> I have set the stripe to Lustre default. Should I still change the
stripe
>> count to 1?
>> Yes, the reads are sequential.
>>
>> 1. Will increasing / decreasing the number of max_rpcs_in_flight help?
>> It is set to 32 now.
>>
>> 2. Just for experimenting, would it speed up listing if 1 OSS served 2
>> OSTs, thereby reducing the number of OSS and RPCs?
>>
>> Regards,
>>
>>
>> Indivar Nair
>>
>>
>> On Wed, Sep 7, 2011 at 1:28 AM, Adrian Ulrich <adrian at
blinkenlights.ch>wrote:
>>
>>>
>>> > While normal file access works fine, the directory listing is
extremely
>>> > slow.
>>> > Depending on the number of files in a directory, the listing
takes
>>> around 5
>>> > - 15 secs.
>>> >
>>> > I tried ''ls --color=none'' and it worked
fine; listed the contents
>>> > immediately.
>>>
>>> That''s because ''ls --color=always|auto''
does an lstat() of each file
>>> (--color=none doesn''t) which causes Lustre to send:
>>>
>>>  - 1 RPC to the MDS per file
>>>  - 1 RPC (per file) to EACH OSS where the file is stored to get the
>>> filesize
>>>
>>> Some time ago i''ve created a patch to speed up
''ls'' while keeping (most)
>>> of the colors
>>> (
>>>
https://github.com/adrian-bl/patchwork/blob/master/coreutils/ls/ls-lustre.diff
>>> )
>>>
>>> But patching samba will not be possible in your case as it really
needs
>>> the information returned by stat().
>>>
>>>
>>> > Double clicking on directory takes a long long time to
display.
>>>
>>> Attach `strace'' to samba: It will probably be busy doing
lstat() which is
>>> a ''slow'' operation on Lustre in any case.
>>>
>>>
>>>
>>> > The cluster consist of -
>>> > - two DRBD Mirrored MDS Servers (Dell R610s) with 10K RPM
disks
>>> > - four OSS Nodes (2 Node Cluster (Dell R710s) with a common
storage
>>> (Dell
>>> > MD3200))
>>>
>>> How many OSTs do you have per OSS?
>>> What''s your stripe setting? Setting the stripe to 1 could
give you a huge
>>> speedup (without affecting normal I/O as i assume that the 9MB
files are
>>> read/written sequentially)
>>>
>>>
>>> Regards,
>>>  Adrian
>>>
>>>
>>> --
>>>  RFC 1925:
>>>   (11) Every old idea will be proposed again with a different name
and
>>>        a different presentation, regardless of whether it works.
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110906/2152a092/attachment-0001.html

Michael Barnes

2011-Sep-07 13:10 UTC

head link

[Lustre-discuss] Slow Directory Listing

Another thing to try is setting vfs_cache_pressure=0 on the OSSes and
periodically setting it to nonzero to reclaim memory.  More details
here:

 
http://www.olcf.ornl.gov/wp-content/events/lug2011/4-14-2011/830-900_Robin_Humble_rjh.lug2011.pdf

-mb

On Sep 6, 2011, at 2:43 AM, Indivar Nair wrote:
> Hi ...,
> 
> I have a lustre storage that stores lots of small files i.e. hundreds to
> thousand of 9MB image files.
> While normal file access works fine, the directory listing is extremely
> slow.
> Depending on the number of files in a directory, the listing takes around 5
> - 15 secs.
> 
> I tried ''ls --color=none'' and it worked fine; listed the
contents
> immediately.
> 
> But that doesn''t help my cause. I have Samba Gateway Servers, and
all users
> access the storage through the gateway. Double clicking on directory takes
a
> long long time to display.
> 
> The cluster consist of -
> - two DRBD Mirrored MDS Servers (Dell R610s) with 10K RPM disks
> - four OSS Nodes (2 Node Cluster (Dell R710s) with a common storage (Dell
> MD3200))
> 
> The storage consists of 12 x 1TB HDDs on both arrays, in RAID 6
> Configuration.
> 
> What actually happens when one does a listing like this?
> What can I do to make the listing faster?
> Could it be an MDS issue?
> Some site suggested that this could be caused due to ''-o
flock'' switch. Is
> it so?
> 
> Kindly Help.
> The storage is in Production, and this is causing a lot of issues.
> 
> Regards,
> 
> 
> Indivar Nair
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
--
+-----------------------------------------------
| Michael Barnes
|
| Thomas Jefferson National Accelerator Facility
| Scientific Computing Group
| 12000 Jefferson Ave.
| Newport News, VA 23606
| (757) 269-7634
+-----------------------------------------------

Indivar Nair

2011-Sep-11 18:30 UTC

head link

[Lustre-discuss] Slow Directory Listing

Hi ...,

Thanks for the inputs Adrian, Jeremy, Michael.
>From Adrian''s explanation, I gather that the OSC generates 1 RPC to
each OSTfor each file. Since there is only 1 OST in each of the 4 OSS, we only get
128 simultaneous RPCs. So Listing 2000 files would only get us that much
speed, right?

Now, each of the OST is around 4.5 TB in size. So say, we reduce the disk
size 1.125TB, but increase the number to 4, then we would get
4OSTx32RPCs=128 RPC connections to each OSS, and 512 simultaneous RPCs
across the Lustre storage. Wouldn''t this increase the listing speed
four
times over?

Currently, we have around 12GB of RAM on each OSS. I belive, we will have
increase this to accommodate the extra 3 OSTs and another 4 OSTs in case of
failover. We will also require a proportionate increase in MDS RAM too.

Is my theory right? Is there any catch to it?

Also, can 1 RPC consume more than 1 I/O thread, say like, it reads from the
buffer of one I/O and then moves to the next I/O buffer? Or is it strictly 1
RPC = 1 I/O?

Regards,



Indivar Nair


On Wed, Sep 7, 2011 at 6:40 PM, Michael Barnes <Michael.Barnes at
jlab.org>wrote:
>
> Another thing to try is setting vfs_cache_pressure=0 on the OSSes and
> periodically setting it to nonzero to reclaim memory.  More details
> here:
>
>
>
http://www.olcf.ornl.gov/wp-content/events/lug2011/4-14-2011/830-900_Robin_Humble_rjh.lug2011.pdf
>
> -mb
>
> On Sep 6, 2011, at 2:43 AM, Indivar Nair wrote:
>
> > Hi ...,
> >
> > I have a lustre storage that stores lots of small files i.e. hundreds
to
> > thousand of 9MB image files.
> > While normal file access works fine, the directory listing is
extremely
> > slow.
> > Depending on the number of files in a directory, the listing takes
around
> 5
> > - 15 secs.
> >
> > I tried ''ls --color=none'' and it worked fine; listed
the contents
> > immediately.
> >
> > But that doesn''t help my cause. I have Samba Gateway Servers,
and all
> users
> > access the storage through the gateway. Double clicking on directory
> takes a
> > long long time to display.
> >
> > The cluster consist of -
> > - two DRBD Mirrored MDS Servers (Dell R610s) with 10K RPM disks
> > - four OSS Nodes (2 Node Cluster (Dell R710s) with a common storage
(Dell
> > MD3200))
> >
> > The storage consists of 12 x 1TB HDDs on both arrays, in RAID 6
> > Configuration.
> >
> > What actually happens when one does a listing like this?
> > What can I do to make the listing faster?
> > Could it be an MDS issue?
> > Some site suggested that this could be caused due to ''-o
flock'' switch.
> Is
> > it so?
> >
> > Kindly Help.
> > The storage is in Production, and this is causing a lot of issues.
> >
> > Regards,
> >
> >
> > Indivar Nair
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> --
> +-----------------------------------------------
> | Michael Barnes
> |
> | Thomas Jefferson National Accelerator Facility
> | Scientific Computing Group
> | 12000 Jefferson Ave.
> | Newport News, VA 23606
> | (757) 269-7634
> +-----------------------------------------------
>
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110912/655b6e98/attachment.html

Jeremy Filizetti

2011-Sep-12 01:08 UTC

head link

[Lustre-discuss] Slow Directory Listing

> From Adrian''s explanation, I gather that the OSC generates 1 RPC
to each
> OST for each file. Since there is only 1 OST in each of the 4 OSS, we only
> get 128 simultaneous RPCs. So Listing 2000 files would only get us that
much
> speed, right?
>
There is no concurrency in fetching these attributes because the program
that issues the lstat/stat/fstat from userspace is only inquiring about a
single file at a time.  So every RPC becomes a minimum of one
round-trip-time network latency between the client and an OSS assuming
statahead thread fetched MDS attributes and OSS has cached inode structures
(ignoring a few other small additions).  So if you have 2000 files in a
directory and you had an avg network latency of 150 us for a glimpse RPC
(which I''ve seen for cached inodes on the OSS) you have a best case of
2000*.000150=.3 seconds.  Without cached inodes disk latency on the OSS will
make that time far longer and less predictable.

>
> Now, each of the OST is around 4.5 TB in size. So say, we reduce the disk
> size 1.125TB, but increase the number to 4, then we would get
> 4OSTx32RPCs=128 RPC connections to each OSS, and 512 simultaneous RPCs
> across the Lustre storage. Wouldn''t this increase the listing
speed four
> times over?
>
The only hope for speeding this up is probably a code change to implement
async glimpse thread or bulkstat/readdirplus where Lustre could fetch
attributes before userspace requests them so they would be locally cached.

Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110911/e9c724ee/attachment.html

Indivar Nair

2011-Sep-12 03:39 UTC

head link

[Lustre-discuss] Slow Directory Listing

So what you are saying is - The OSC (stat) issues the 1st RPC to the 1st
OST, waits for its response, then issues the 2nd RPC to the 2nd OST, so on
and so forth till it ''stat''s all the 2000 files. That would be
slow :-(.

Why does Lustre do it this way, while everywhere else its trys to do extreme
parallelization?
Would patching lstat/stat/fstat to parallelize requests only when accessing
a Lustre store be possible?

Regards,


Indivar Nair

On Mon, Sep 12, 2011 at 6:38 AM, Jeremy Filizetti <
jeremy.filizetti at gmail.com> wrote:
>
> From Adrian''s explanation, I gather that the OSC generates 1 RPC
to each
>> OST for each file. Since there is only 1 OST in each of the 4 OSS, we
only
>> get 128 simultaneous RPCs. So Listing 2000 files would only get us that
much
>> speed, right?
>>
>
> There is no concurrency in fetching these attributes because the program
> that issues the lstat/stat/fstat from userspace is only inquiring about a
> single file at a time.  So every RPC becomes a minimum of one
> round-trip-time network latency between the client and an OSS assuming
> statahead thread fetched MDS attributes and OSS has cached inode structures
> (ignoring a few other small additions).  So if you have 2000 files in a
> directory and you had an avg network latency of 150 us for a glimpse RPC
> (which I''ve seen for cached inodes on the OSS) you have a best
case of
> 2000*.000150=.3 seconds.  Without cached inodes disk latency on the OSS
will
> make that time far longer and less predictable.
>
>
>>
>> Now, each of the OST is around 4.5 TB in size. So say, we reduce the
disk
>> size 1.125TB, but increase the number to 4, then we would get
>> 4OSTx32RPCs=128 RPC connections to each OSS, and 512 simultaneous RPCs
>> across the Lustre storage. Wouldn''t this increase the listing
speed four
>> times over?
>>
>
> The only hope for speeding this up is probably a code change to implement
> async glimpse thread or bulkstat/readdirplus where Lustre could fetch
> attributes before userspace requests them so they would be locally cached.
>
> Jeremy
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110912/c6d21b45/attachment.html

Indivar Nair

2011-Sep-12 04:26 UTC

head link

[Lustre-discuss] Slow Directory Listing

Sorry, didn''t get it the first time -
You say ''the program that issues the lstat/stat/fstat from userspace is
only
inquiring about a single file at a time''. By ''the
program'' you mean ''samba''
or ''ls'' in our case.

Or is it the ''Windows Explorer'' that is a triggering a
''stat'' on each file
on the samba server?

Regards,


Indivar Nair

On Mon, Sep 12, 2011 at 9:09 AM, Indivar Nair <indivar.nair at
techterra.in>wrote:
> So what you are saying is - The OSC (stat) issues the 1st RPC to the 1st
> OST, waits for its response, then issues the 2nd RPC to the 2nd OST, so on
> and so forth till it ''stat''s all the 2000 files. That
would be slow :-(.
>
> Why does Lustre do it this way, while everywhere else its trys to do
> extreme parallelization?
> Would patching lstat/stat/fstat to parallelize requests only when accessing
> a Lustre store be possible?
>
> Regards,
>
>
> Indivar Nair
>
>
> On Mon, Sep 12, 2011 at 6:38 AM, Jeremy Filizetti <
> jeremy.filizetti at gmail.com> wrote:
>
>>
>> From Adrian''s explanation, I gather that the OSC generates 1
RPC to each
>>> OST for each file. Since there is only 1 OST in each of the 4 OSS,
we only
>>> get 128 simultaneous RPCs. So Listing 2000 files would only get us
that much
>>> speed, right?
>>>
>>
>> There is no concurrency in fetching these attributes because the
program
>> that issues the lstat/stat/fstat from userspace is only inquiring about
a
>> single file at a time.  So every RPC becomes a minimum of one
>> round-trip-time network latency between the client and an OSS assuming
>> statahead thread fetched MDS attributes and OSS has cached inode
structures
>> (ignoring a few other small additions).  So if you have 2000 files in a
>> directory and you had an avg network latency of 150 us for a glimpse
RPC
>> (which I''ve seen for cached inodes on the OSS) you have a best
case of
>> 2000*.000150=.3 seconds.  Without cached inodes disk latency on the OSS
will
>> make that time far longer and less predictable.
>>
>>
>>>
>>> Now, each of the OST is around 4.5 TB in size. So say, we reduce
the disk
>>> size 1.125TB, but increase the number to 4, then we would get
>>> 4OSTx32RPCs=128 RPC connections to each OSS, and 512 simultaneous
RPCs
>>> across the Lustre storage. Wouldn''t this increase the
listing speed four
>>> times over?
>>>
>>
>> The only hope for speeding this up is probably a code change to
implement
>> async glimpse thread or bulkstat/readdirplus where Lustre could fetch
>> attributes before userspace requests them so they would be locally
cached.
>>
>> Jeremy
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110912/e1d18273/attachment-0001.html

rishi pathak

2011-Sep-12 04:37 UTC

head link

[Lustre-discuss] Slow Directory Listing

On Mon, Sep 12, 2011 at 6:38 AM, Jeremy Filizetti <
jeremy.filizetti at gmail.com> wrote:
>
>  From Adrian''s explanation, I gather that the OSC generates 1 RPC
to each
>> OST for each file. Since there is only 1 OST in each of the 4 OSS, we
only
>> get 128 simultaneous RPCs. So Listing 2000 files would only get us that
much
>> speed, right?
>>
>
> There is no concurrency in fetching these attributes because the program
> that issues the lstat/stat/fstat from userspace is only inquiring about a
> single file at a time.  So every RPC becomes a minimum of one
> round-trip-time network latency between the client and an OSS assuming
> statahead thread fetched MDS attributes and OSS has cached inode structures
> (ignoring a few other small additions).  So if you have 2000 files in a
> directory and you had an avg network latency of 150 us for a glimpse RPC
> (which I''ve seen for cached inodes on the OSS) you have a best
case of
> 2000*.000150=.3 seconds.  Without cached inodes disk latency on the OSS
will
> make that time far longer and less predictable.
>
>
>>
>> Now, each of the OST is around 4.5 TB in size. So say, we reduce the
disk
>> size 1.125TB, but increase the number to 4, then we would get
>> 4OSTx32RPCs=128 RPC connections to each OSS, and 512 simultaneous RPCs
>> across the Lustre storage. Wouldn''t this increase the listing
speed four
>> times over?
>>
>
> The only hope for speeding this up is probably a code change to implement
> async glimpse thread or bulkstat/readdirplus where Lustre could fetch
> attributes before userspace requests them so they would be locally cached.
>
If possible, it would be better to have pattern based pre fetching of
attributes as it is done for files accesses in lustre.

>
> Jeremy
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

-- 
---
Rishi Pathak
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110912/8153f034/attachment.html

Indivar Nair

2011-Sep-12 05:37 UTC

head link

[Lustre-discuss] Slow Directory Listing

So this is how the flow is -

Windows Explorer request''s statistics of a single file in the dir ->
Samba
initiates a ''stat'' call on the file -> MDC initiates RPC
request to MDT;
gets response -> OSC initiates an RPC to OST; gets response -> Response
given back to stat / Samba -> Samba sends the statistics back to explorer.

Hmmm..., doing this 2000 times is going to take a long time.
And there is no way we can fix explorer to do a bulk stat request :-(.

So the only option is to get Lustre to respond faster to individual
requests.
Is there anyway to increase the Size and TTL of file metadata cache in MDTs
and OSTs?
And how does the patch work then? If a request is for only 1 file stat, how
does multiple pages in readdir() help?

Regards,


Indivar Nair


On Mon, Sep 12, 2011 at 10:31 AM, Jeremy Filizetti <
jeremy.filizetti at gmail.com> wrote:
>
> On Sep 12, 2011 12:27 AM, "Indivar Nair" <indivar.nair at
techterra.in>
> wrote:
> >
> > Sorry, didn''t get it the first time -
> > You say ''the program that issues the lstat/stat/fstat from
userspace is
> only inquiring about a single file at a time''. By ''the
program'' you mean
> ''samba'' or ''ls'' in our case.
> >
> Correct those programs are issuing the syscalls.
>
> > Or is it the ''Windows Explorer'' that is a triggering
a ''stat'' on each
> file on the samba server?
> >
>
> Win explorer is sending the requests to samba and samba just issues the
> syscall to retrieve the information and send it back.
>
> > Regards,
> >
> >
> > Indivar Nair
> >
> > On Mon, Sep 12, 2011 at 9:09 AM, Indivar Nair <indivar.nair at
techterra.in>
> wrote:
> >>
> >> So what you are saying is - The OSC (stat) issues the 1st RPC to
the 1st
> OST, waits for its response, then issues the 2nd RPC to the 2nd OST, so on
> and so forth till it ''stat''s all the 2000 files. That
would be slow :-(.
> >>
> >> Why does Lustre do it this way, while everywhere else its trys to
do
> extreme parallelization?
> >> Would patching lstat/stat/fstat to parallelize requests only when
> accessing a Lustre store be possible?
> >>
> >> Regards,
> >>
> >>
> >> Indivar Nair
> >>
> >>
> >> On Mon, Sep 12, 2011 at 6:38 AM, Jeremy Filizetti <
> jeremy.filizetti at gmail.com> wrote:
> >>>
> >>>
> >>>> From Adrian''s explanation, I gather that the OSC
generates 1 RPC to
> each OST for each file. Since there is only 1 OST in each of the 4 OSS, we
> only get 128 simultaneous RPCs. So Listing 2000 files would only get us
that
> much speed, right?
> >>>
> >>>
> >>> There is no concurrency in fetching these attributes because
the
> program that issues the lstat/stat/fstat from userspace is only inquiring
> about a single file at a time.  So every RPC becomes a minimum of one
> round-trip-time network latency between the client and an OSS assuming
> statahead thread fetched MDS attributes and OSS has cached inode structures
> (ignoring a few other small additions).  So if you have 2000 files in a
> directory and you had an avg network latency of 150 us for a glimpse RPC
> (which I''ve seen for cached inodes on the OSS) you have a best
case of
> 2000*.000150=.3 seconds.  Without cached inodes disk latency on the OSS
will
> make that time far longer and less predictable.
> >>>
> >>>>
> >>>>
> >>>> Now, each of the OST is around 4.5 TB in size. So say, we
reduce the
> disk size 1.125TB, but increase the number to 4, then we would get
> 4OSTx32RPCs=128 RPC connections to each OSS, and 512 simultaneous RPCs
> across the Lustre storage. Wouldn''t this increase the listing
speed four
> times over?
> >>>
> >>>
> >>> The only hope for speeding this up is probably a code change
to
> implement async glimpse thread or bulkstat/readdirplus where Lustre could
> fetch attributes before userspace requests them so they would be locally
> cached.
> >>>
> >>> Jeremy
> >>
> >>
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110912/6c01766a/attachment.html

Indivar Nair

2011-Sep-12 08:55 UTC

head link

[Lustre-discuss] Slow Directory Listing

Okay, was going thru the whole mail trail again and got answers to my last
set of questions from your (Jeremy''s) earlier mail -

And how does the patch work then? If a request is for only 1 file stat, how
does multiple pages in readdir() help?

Ans:
''The only hope for speeding this up is probably a code change to
implement
async glimpse thread or bulkstat/readdirplus where Lustre could fetch
attributes before userspace requests them so they would be locally
cached.''

Plus setting ''vfs_cache_pressure=0'' or to a very low number is
the solution.

Thanks,

Regards,


Indivar Nair

On Mon, Sep 12, 2011 at 11:07 AM, Indivar Nair <indivar.nair at
techterra.in>wrote:
> So this is how the flow is -
>
> Windows Explorer request''s statistics of a single file in the dir
-> Samba
> initiates a ''stat'' call on the file -> MDC initiates
RPC request to MDT;
> gets response -> OSC initiates an RPC to OST; gets response ->
Response
> given back to stat / Samba -> Samba sends the statistics back to
explorer.
>
> Hmmm..., doing this 2000 times is going to take a long time.
> And there is no way we can fix explorer to do a bulk stat request :-(.
>
> So the only option is to get Lustre to respond faster to individual
> requests.
> Is there anyway to increase the Size and TTL of file metadata cache in MDTs
> and OSTs?
> And how does the patch work then? If a request is for only 1 file stat, how
> does multiple pages in readdir() help?
>
> Regards,
>
>
> Indivar Nair
>
>
>
> On Mon, Sep 12, 2011 at 10:31 AM, Jeremy Filizetti <
> jeremy.filizetti at gmail.com> wrote:
>
>>
>> On Sep 12, 2011 12:27 AM, "Indivar Nair" <indivar.nair at
techterra.in>
>> wrote:
>> >
>> > Sorry, didn''t get it the first time -
>> > You say ''the program that issues the lstat/stat/fstat
from userspace is
>> only inquiring about a single file at a time''. By
''the program'' you mean
>> ''samba'' or ''ls'' in our case.
>> >
>> Correct those programs are issuing the syscalls.
>>
>> > Or is it the ''Windows Explorer'' that is a
triggering a ''stat'' on each
>> file on the samba server?
>> >
>>
>> Win explorer is sending the requests to samba and samba just issues the
>> syscall to retrieve the information and send it back.
>>
>> > Regards,
>> >
>> >
>> > Indivar Nair
>> >
>> > On Mon, Sep 12, 2011 at 9:09 AM, Indivar Nair <
>> indivar.nair at techterra.in> wrote:
>> >>
>> >> So what you are saying is - The OSC (stat) issues the 1st RPC
to the
>> 1st OST, waits for its response, then issues the 2nd RPC to the 2nd
OST, so
>> on and so forth till it ''stat''s all the 2000 files.
That would be slow :-(.
>> >>
>> >> Why does Lustre do it this way, while everywhere else its trys
to do
>> extreme parallelization?
>> >> Would patching lstat/stat/fstat to parallelize requests only
when
>> accessing a Lustre store be possible?
>> >>
>> >> Regards,
>> >>
>> >>
>> >> Indivar Nair
>> >>
>> >>
>> >> On Mon, Sep 12, 2011 at 6:38 AM, Jeremy Filizetti <
>> jeremy.filizetti at gmail.com> wrote:
>> >>>
>> >>>
>> >>>> From Adrian''s explanation, I gather that the
OSC generates 1 RPC to
>> each OST for each file. Since there is only 1 OST in each of the 4 OSS,
we
>> only get 128 simultaneous RPCs. So Listing 2000 files would only get us
that
>> much speed, right?
>> >>>
>> >>>
>> >>> There is no concurrency in fetching these attributes
because the
>> program that issues the lstat/stat/fstat from userspace is only
inquiring
>> about a single file at a time.  So every RPC becomes a minimum of one
>> round-trip-time network latency between the client and an OSS assuming
>> statahead thread fetched MDS attributes and OSS has cached inode
structures
>> (ignoring a few other small additions).  So if you have 2000 files in a
>> directory and you had an avg network latency of 150 us for a glimpse
RPC
>> (which I''ve seen for cached inodes on the OSS) you have a best
case of
>> 2000*.000150=.3 seconds.  Without cached inodes disk latency on the OSS
will
>> make that time far longer and less predictable.
>> >>>
>> >>>>
>> >>>>
>> >>>> Now, each of the OST is around 4.5 TB in size. So say,
we reduce the
>> disk size 1.125TB, but increase the number to 4, then we would get
>> 4OSTx32RPCs=128 RPC connections to each OSS, and 512 simultaneous RPCs
>> across the Lustre storage. Wouldn''t this increase the listing
speed four
>> times over?
>> >>>
>> >>>
>> >>> The only hope for speeding this up is probably a code
change to
>> implement async glimpse thread or bulkstat/readdirplus where Lustre
could
>> fetch attributes before userspace requests them so they would be
locally
>> cached.
>> >>>
>> >>> Jeremy
>> >>
>> >>
>> >
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110912/18a24e76/attachment-0001.html

Lustre discuss - Sep 2011 - Slow Directory Listing

[Lustre-community] Slow Directory Listing

[Lustre-community] Slow Directory Listing

[Lustre-discuss] Slow Directory Listing

[Lustre-discuss] Slow Directory Listing

[Lustre-discuss] Slow Directory Listing

[Lustre-discuss] Slow Directory Listing

[Lustre-discuss] Slow Directory Listing

[Lustre-discuss] Slow Directory Listing

[Lustre-discuss] Slow Directory Listing

[Lustre-discuss] Slow Directory Listing

[Lustre-discuss] Slow Directory Listing

[Lustre-discuss] Slow Directory Listing

[Lustre-discuss] Slow Directory Listing

[Lustre-discuss] Slow Directory Listing