thr3ads.net - Lustre devel - [Lustre-devel] Erratum about indexes in robinhood DB [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Eric Barton

2011-Oct-05 13:03 UTC

[Lustre-devel] Erratum about indexes in robinhood DB

Thomas,

Thanks a lot and I hope you don''t mind me cc-ing lustre-devel as this
seems to be of general interest.

Do you have a feel (or measurements :) for the rate at which a changelog
can be ingested into robinhood?  And I''m wondering about DNE and
multiple
changelogs coming from multiple MDTs.  I''d be very interested to know
if
you''ve thought about this and have views on what the maximum ingest
rate
could be and whether there will be issues coordinating/merging events
across multiple feeds.

          Cheers,
                   Eric

Eric Barton
CTO Whamcloud, Inc.
> -----Original Message-----
> From: LEIBOVICI Thomas [mailto:thomas.leibovici at cea.fr]
> Sent: 29 September 2011 10:04 AM
> To: Eric Barton
> Subject: Erratum about indexes in robinhood DB
> 
> Hello Eric,
> 
> Re-thinking about your question on indexes in robinhood DB, my answer
> was incomplete.
> Actually, there are indexes on user/group/type/status, but there are not
> on the main table:
> 
> 1) As I said you, on the main table (the one that list all FS entries),
> there are as few indexes as possible (just fid as primary key, and
> parent fid)
> in order to preserve a good insert/update rate on this table whatever
> the FS size (the deeper the DB index trees, the slower those requests).
> 
> 2) There is a secondary table where robinhood maintains aggregated
> statitics like nbr entries, volume per user/group/type/(hsm)status and
> which is updated on the fly.
> This one as indexes on quite all its fields, which makes it possible to
> get instantaneous stats per user, etc. without penalizing insert/update
> rate on main table.
> Indexes on this secondary table are less expensive, given that the set
> of users is much more resticted that the nbr of entries.
> 
> This time you have a more complete answer.
> 
> Best regards
> Thomas

LEIBOVICI Thomas

2011-Oct-06 11:01 UTC

head link

[Lustre-devel] Erratum about indexes in robinhood DB

Hello Eric,

With a fast enough feeder, the ingest rate robinhood can currently 
sustain is between 50.000/sec and 100.000/sec
(depending on insert/update/remove ratio) with a basic MySQL DB stored 
on a local disk.
This can certainly still be improved with MySQL tunings and/or better HW 
and/or enterprise class DB,
but for now, we notice it is easily high enough for reading a MDT 
changelog stream on a Petaflopic system.

This rate is actually lower when processing Lustre MDT changelogs (but I 
have no measurement) because of "stat" operations to get file
attributes
(unfortunately, changelogs do not give the new value of what has just 
changed, e.g new uid for a chown operation, new size&mtime with a mtime 
event...)
SOM will probably improve that point, but it could be a good idea to add 
more info in changelogs.

Handling chglogs from multiple MDTs is indeed a very interesting point 
to address.
The main issue is the database scaling in terms of operation rate, 
volume and entry location.
A solution could be using an existing clustered DB engine (MySQL 
cluster, NOSQL DBs...),
thus we are going to take a look at the different alternatives and see 
if they could match the need.
For that, it would be interesting to know how records will be splitted 
into the multiple changelog streams:
is a given fid always reported by the same stream? what about the parent 
fid (like in create/unlink operations)?
If you have a document about DNE design, I think it would give a more 
precise idea about
what event and fid is supposed to be reported by each MDT.

Thanks,
Thomas

Eric Barton wrote:> Thomas,
>
> Thanks a lot and I hope you don''t mind me cc-ing lustre-devel as
this
> seems to be of general interest.
>
> Do you have a feel (or measurements :) for the rate at which a changelog
> can be ingested into robinhood?  And I''m wondering about DNE and
multiple
> changelogs coming from multiple MDTs.  I''d be very interested to
know if
> you''ve thought about this and have views on what the maximum
ingest rate
> could be and whether there will be issues coordinating/merging events
> across multiple feeds.
>
>           Cheers,
>                    Eric
>
> Eric Barton
> CTO Whamcloud, Inc.
>
>   
>> -----Original Message-----
>> From: LEIBOVICI Thomas [mailto:thomas.leibovici at cea.fr]
>> Sent: 29 September 2011 10:04 AM
>> To: Eric Barton
>> Subject: Erratum about indexes in robinhood DB
>>
>> Hello Eric,
>>
>> Re-thinking about your question on indexes in robinhood DB, my answer
>> was incomplete.
>> Actually, there are indexes on user/group/type/status, but there are
not
>> on the main table:
>>
>> 1) As I said you, on the main table (the one that list all FS entries),
>> there are as few indexes as possible (just fid as primary key, and
>> parent fid)
>> in order to preserve a good insert/update rate on this table whatever
>> the FS size (the deeper the DB index trees, the slower those requests).
>>
>> 2) There is a secondary table where robinhood maintains aggregated
>> statitics like nbr entries, volume per user/group/type/(hsm)status and
>> which is updated on the fly.
>> This one as indexes on quite all its fields, which makes it possible to
>> get instantaneous stats per user, etc. without penalizing insert/update
>> rate on main table.
>> Indexes on this secondary table are less expensive, given that the set
>> of users is much more resticted that the nbr of entries.
>>
>> This time you have a more complete answer.
>>
>> Best regards
>> Thomas
>>     
>
>

Eric Barton

2011-Oct-11 13:04 UTC

head link

[Lustre-devel] Erratum about indexes in robinhood DB

Thomas,

Interesting point about changelog entries requiring a ''stat''.

Nathan, what''s your take on making changelogs tell you what has
changed - even if only on "easy" changes?

          Cheers,
                   Eric
> -----Original Message-----
> From: LEIBOVICI Thomas [mailto:thomas.leibovici at cea.fr]
> Sent: 06 October 2011 12:02 PM
> To: Eric Barton
> Cc: lustre-devel at lists.lustre.org
> Subject: Re: Erratum about indexes in robinhood DB
> 
> Hello Eric,
> 
> With a fast enough feeder, the ingest rate robinhood can currently
> sustain is between 50.000/sec and 100.000/sec
> (depending on insert/update/remove ratio) with a basic MySQL DB stored
> on a local disk.
> This can certainly still be improved with MySQL tunings and/or better HW
> and/or enterprise class DB,
> but for now, we notice it is easily high enough for reading a MDT
> changelog stream on a Petaflopic system.
> 
> This rate is actually lower when processing Lustre MDT changelogs (but I
> have no measurement) because of "stat" operations to get file
attributes
> (unfortunately, changelogs do not give the new value of what has just
> changed, e.g new uid for a chown operation, new size&mtime with a mtime
> event...)
> SOM will probably improve that point, but it could be a good idea to add
> more info in changelogs.
> 
> Handling chglogs from multiple MDTs is indeed a very interesting point
> to address.
> The main issue is the database scaling in terms of operation rate,
> volume and entry location.
> A solution could be using an existing clustered DB engine (MySQL
> cluster, NOSQL DBs...),
> thus we are going to take a look at the different alternatives and see
> if they could match the need.
> For that, it would be interesting to know how records will be splitted
> into the multiple changelog streams:
> is a given fid always reported by the same stream? what about the parent
> fid (like in create/unlink operations)?
> If you have a document about DNE design, I think it would give a more
> precise idea about
> what event and fid is supposed to be reported by each MDT.
> 
> Thanks,
> Thomas
> 
> Eric Barton wrote:
> > Thomas,
> >
> > Thanks a lot and I hope you don''t mind me cc-ing lustre-devel
as this
> > seems to be of general interest.
> >
> > Do you have a feel (or measurements :) for the rate at which a
changelog
> > can be ingested into robinhood?  And I''m wondering about DNE
and multiple
> > changelogs coming from multiple MDTs.  I''d be very interested
to know if
> > you''ve thought about this and have views on what the maximum
ingest rate
> > could be and whether there will be issues coordinating/merging events
> > across multiple feeds.
> >
> >           Cheers,
> >                    Eric
> >
> > Eric Barton
> > CTO Whamcloud, Inc.
> >
> >
> >> -----Original Message-----
> >> From: LEIBOVICI Thomas [mailto:thomas.leibovici at cea.fr]
> >> Sent: 29 September 2011 10:04 AM
> >> To: Eric Barton
> >> Subject: Erratum about indexes in robinhood DB
> >>
> >> Hello Eric,
> >>
> >> Re-thinking about your question on indexes in robinhood DB, my
answer
> >> was incomplete.
> >> Actually, there are indexes on user/group/type/status, but there
are not
> >> on the main table:
> >>
> >> 1) As I said you, on the main table (the one that list all FS
entries),
> >> there are as few indexes as possible (just fid as primary key, and
> >> parent fid)
> >> in order to preserve a good insert/update rate on this table
whatever
> >> the FS size (the deeper the DB index trees, the slower those
requests).
> >>
> >> 2) There is a secondary table where robinhood maintains aggregated
> >> statitics like nbr entries, volume per user/group/type/(hsm)status
and
> >> which is updated on the fly.
> >> This one as indexes on quite all its fields, which makes it
possible to
> >> get instantaneous stats per user, etc. without penalizing
insert/update
> >> rate on main table.
> >> Indexes on this secondary table are less expensive, given that the
set
> >> of users is much more resticted that the nbr of entries.
> >>
> >> This time you have a more complete answer.
> >>
> >> Best regards
> >> Thomas
> >>
> >
> >

Nathan Rutman

2011-Oct-11 17:12 UTC

head link

[Lustre-devel] Erratum about indexes in robinhood DB

We actually already did some of that for a one-off. We didn''t push the
changes upstream because
there were some ugly layering violations involved.  Vitaly, do you remember the
details?


On Oct 11, 2011, at 6:04 AM, Eric Barton wrote:
> Thomas,
> 
> Interesting point about changelog entries requiring a
''stat''.
> 
> Nathan, what''s your take on making changelogs tell you what has
> changed - even if only on "easy" changes?
> 
>         Cheers,
>                  Eric
> 
>> -----Original Message-----
>> From: LEIBOVICI Thomas [mailto:thomas.leibovici at cea.fr]
>> Sent: 06 October 2011 12:02 PM
>> To: Eric Barton
>> Cc: lustre-devel at lists.lustre.org
>> Subject: Re: Erratum about indexes in robinhood DB
>> 
>> Hello Eric,
>> 
>> With a fast enough feeder, the ingest rate robinhood can currently
>> sustain is between 50.000/sec and 100.000/sec
>> (depending on insert/update/remove ratio) with a basic MySQL DB stored
>> on a local disk.
>> This can certainly still be improved with MySQL tunings and/or better
HW
>> and/or enterprise class DB,
>> but for now, we notice it is easily high enough for reading a MDT
>> changelog stream on a Petaflopic system.
>> 
>> This rate is actually lower when processing Lustre MDT changelogs (but
I
>> have no measurement) because of "stat" operations to get file
attributes
>> (unfortunately, changelogs do not give the new value of what has just
>> changed, e.g new uid for a chown operation, new size&mtime with a
mtime
>> event...)
>> SOM will probably improve that point, but it could be a good idea to
add
>> more info in changelogs.
>> 
>> Handling chglogs from multiple MDTs is indeed a very interesting point
>> to address.
>> The main issue is the database scaling in terms of operation rate,
>> volume and entry location.
>> A solution could be using an existing clustered DB engine (MySQL
>> cluster, NOSQL DBs...),
>> thus we are going to take a look at the different alternatives and see
>> if they could match the need.
>> For that, it would be interesting to know how records will be splitted
>> into the multiple changelog streams:
>> is a given fid always reported by the same stream? what about the
parent
>> fid (like in create/unlink operations)?
>> If you have a document about DNE design, I think it would give a more
>> precise idea about
>> what event and fid is supposed to be reported by each MDT.
>> 
>> Thanks,
>> Thomas
>> 
>> Eric Barton wrote:
>>> Thomas,
>>> 
>>> Thanks a lot and I hope you don''t mind me cc-ing
lustre-devel as this
>>> seems to be of general interest.
>>> 
>>> Do you have a feel (or measurements :) for the rate at which a
changelog
>>> can be ingested into robinhood?  And I''m wondering about
DNE and multiple
>>> changelogs coming from multiple MDTs.  I''d be very
interested to know if
>>> you''ve thought about this and have views on what the
maximum ingest rate
>>> could be and whether there will be issues coordinating/merging
events
>>> across multiple feeds.
>>> 
>>>         Cheers,
>>>                  Eric
>>> 
>>> Eric Barton
>>> CTO Whamcloud, Inc.
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: LEIBOVICI Thomas [mailto:thomas.leibovici at cea.fr]
>>>> Sent: 29 September 2011 10:04 AM
>>>> To: Eric Barton
>>>> Subject: Erratum about indexes in robinhood DB
>>>> 
>>>> Hello Eric,
>>>> 
>>>> Re-thinking about your question on indexes in robinhood DB, my
answer
>>>> was incomplete.
>>>> Actually, there are indexes on user/group/type/status, but
there are not
>>>> on the main table:
>>>> 
>>>> 1) As I said you, on the main table (the one that list all FS
entries),
>>>> there are as few indexes as possible (just fid as primary key,
and
>>>> parent fid)
>>>> in order to preserve a good insert/update rate on this table
whatever
>>>> the FS size (the deeper the DB index trees, the slower those
requests).
>>>> 
>>>> 2) There is a secondary table where robinhood maintains
aggregated
>>>> statitics like nbr entries, volume per
user/group/type/(hsm)status and
>>>> which is updated on the fly.
>>>> This one as indexes on quite all its fields, which makes it
possible to
>>>> get instantaneous stats per user, etc. without penalizing
insert/update
>>>> rate on main table.
>>>> Indexes on this secondary table are less expensive, given that
the set
>>>> of users is much more resticted that the nbr of entries.
>>>> 
>>>> This time you have a more complete answer.
>>>> 
>>>> Best regards
>>>> Thomas
>>>> 
>>> 
>>> 
> ______________________________________________________________________
This email may contain privileged or confidential information, which should only
be used for the purpose for which it was sent by Xyratex. No further rights or
licenses are granted to use such information. If you are not the intended
recipient of this message, please notify the sender by return and delete it. You
may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised
amendment for which Xyratex does not accept liability. While we have taken
reasonable precautions to ensure that this email is free of viruses, Xyratex
does not accept liability for the presence of any computer viruses in this
email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales,
Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in
Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia)
Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in
The People''s Republic of China and Xyratex Japan Limited registered in
Japan.
______________________________________________________________________

Vitaly Fertman

2011-Oct-11 18:42 UTC

head link

[Lustre-devel] Erratum about indexes in robinhood DB

HI,

yes, there is a fixed patch which adds some info to the changelog,
UID, GID, NID. it is not a problem to pack it with other (changed)
inode attributes. not sent upstream because an issue has been 
found which fix has not been landed yet.

On Oct 11, 2011, at 9:12 PM, Nathan Rutman wrote:
> We actually already did some of that for a one-off. We didn''t push
the changes upstream because
> there were some ugly layering violations involved.  Vitaly, do you remember
the details?
> 
> 
> On Oct 11, 2011, at 6:04 AM, Eric Barton wrote:
> 
>> Thomas,
>> 
>> Interesting point about changelog entries requiring a
''stat''.
>> 
>> Nathan, what''s your take on making changelogs tell you what
has
>> changed - even if only on "easy" changes?
>> 
>>        Cheers,
>>                 Eric
>> 
>>> -----Original Message-----
>>> From: LEIBOVICI Thomas [mailto:thomas.leibovici at cea.fr]
>>> Sent: 06 October 2011 12:02 PM
>>> To: Eric Barton
>>> Cc: lustre-devel at lists.lustre.org
>>> Subject: Re: Erratum about indexes in robinhood DB
>>> 
>>> Hello Eric,
>>> 
>>> With a fast enough feeder, the ingest rate robinhood can currently
>>> sustain is between 50.000/sec and 100.000/sec
>>> (depending on insert/update/remove ratio) with a basic MySQL DB
stored
>>> on a local disk.
>>> This can certainly still be improved with MySQL tunings and/or
better HW
>>> and/or enterprise class DB,
>>> but for now, we notice it is easily high enough for reading a MDT
>>> changelog stream on a Petaflopic system.
>>> 
>>> This rate is actually lower when processing Lustre MDT changelogs
(but I
>>> have no measurement) because of "stat" operations to get
file attributes
>>> (unfortunately, changelogs do not give the new value of what has
just
>>> changed, e.g new uid for a chown operation, new size&mtime with
a mtime
>>> event...)
>>> SOM will probably improve that point, but it could be a good idea
to add
>>> more info in changelogs.
>>> 
>>> Handling chglogs from multiple MDTs is indeed a very interesting
point
>>> to address.
>>> The main issue is the database scaling in terms of operation rate,
>>> volume and entry location.
>>> A solution could be using an existing clustered DB engine (MySQL
>>> cluster, NOSQL DBs...),
>>> thus we are going to take a look at the different alternatives and
see
>>> if they could match the need.
>>> For that, it would be interesting to know how records will be
splitted
>>> into the multiple changelog streams:
>>> is a given fid always reported by the same stream? what about the
parent
>>> fid (like in create/unlink operations)?
>>> If you have a document about DNE design, I think it would give a
more
>>> precise idea about
>>> what event and fid is supposed to be reported by each MDT.
>>> 
>>> Thanks,
>>> Thomas
>>> 
>>> Eric Barton wrote:
>>>> Thomas,
>>>> 
>>>> Thanks a lot and I hope you don''t mind me cc-ing
lustre-devel as this
>>>> seems to be of general interest.
>>>> 
>>>> Do you have a feel (or measurements :) for the rate at which a
changelog
>>>> can be ingested into robinhood?  And I''m wondering
about DNE and multiple
>>>> changelogs coming from multiple MDTs.  I''d be very
interested to know if
>>>> you''ve thought about this and have views on what the
maximum ingest rate
>>>> could be and whether there will be issues coordinating/merging
events
>>>> across multiple feeds.
>>>> 
>>>>        Cheers,
>>>>                 Eric
>>>> 
>>>> Eric Barton
>>>> CTO Whamcloud, Inc.
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: LEIBOVICI Thomas [mailto:thomas.leibovici at cea.fr]
>>>>> Sent: 29 September 2011 10:04 AM
>>>>> To: Eric Barton
>>>>> Subject: Erratum about indexes in robinhood DB
>>>>> 
>>>>> Hello Eric,
>>>>> 
>>>>> Re-thinking about your question on indexes in robinhood DB,
my answer
>>>>> was incomplete.
>>>>> Actually, there are indexes on user/group/type/status, but
there are not
>>>>> on the main table:
>>>>> 
>>>>> 1) As I said you, on the main table (the one that list all
FS entries),
>>>>> there are as few indexes as possible (just fid as primary
key, and
>>>>> parent fid)
>>>>> in order to preserve a good insert/update rate on this
table whatever
>>>>> the FS size (the deeper the DB index trees, the slower
those requests).
>>>>> 
>>>>> 2) There is a secondary table where robinhood maintains
aggregated
>>>>> statitics like nbr entries, volume per
user/group/type/(hsm)status and
>>>>> which is updated on the fly.
>>>>> This one as indexes on quite all its fields, which makes it
possible to
>>>>> get instantaneous stats per user, etc. without penalizing
insert/update
>>>>> rate on main table.
>>>>> Indexes on this secondary table are less expensive, given
that the set
>>>>> of users is much more resticted that the nbr of entries.
>>>>> 
>>>>> This time you have a more complete answer.
>>>>> 
>>>>> Best regards
>>>>> Thomas
>>>>> 
>>>> 
>>>> 
>> 
--
Vitaly

Lustre devel - Oct 2011 - Erratum about indexes in robinhood DB

[Lustre-devel] Erratum about indexes in robinhood DB

[Lustre-devel] Erratum about indexes in robinhood DB

[Lustre-devel] Erratum about indexes in robinhood DB

[Lustre-devel] Erratum about indexes in robinhood DB

[Lustre-devel] Erratum about indexes in robinhood DB