thr3ads.net - Lustre devel - [Lustre-devel] Wide striping [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Nathan Rutman

2011-Oct-03 20:15 UTC

[Lustre-devel] Wide striping

Oracle BZ-4424 (continued in WC LU-80) adds support for larger OST stripe counts
via increased EXT4 EA sizes.
Some problems with this are:
1) increased MDT storage and network loading for transmitting the object list
2) relative low new limit (1350 up from 160)

We have been thinking about a different wide-striping method that
doesn''t have these problems. The basic idea is to create a new stripe
type that encodes the list of OSTs compactly, and then using the same (or a
calculable) object identifier (or FID) on all these OSTs.
<https://lh4.googleusercontent.com/mxm5R4Yd000I_v5qNcpYH6ZzHBvryGEE6pjxOBWz6ysHUNK0Yjh1J81kmP-5zVaoCiOU8RJv04WMhNoe1JqipOOmtRd7otrZ0saWKUnNyNVvaWvLRD8>

Our version of widestriping does not involve increasing the EA size at all, but
instead utilizes a new stripe pattern. (This will not be understandable by
older Lustre versions, which will generate an error locally, or potentially we
can convert into the BZ-4424 form if the layout fits in that format). A bitmap
will identify which OSTs hold a stripe of this file. The bitmap should probably
fit into current ext4 EA size limit, giving us ~32k stripes.

Some OST?s may be down at file creation time, or new OSTs added later; hence
there will likely be holes in the bitmap (but relatively few). Start index will
still be used, but stripe order will be strictly round-robin (we will wrap
around). In other words, the stripe sequence will always be in linear OST
order, starting from start_index, maybe skipping some holes, wrapping around to
start_index-1.

Widestripe objects do not need a special sequence number (fid_seq); the MDT
knows the file was created as widestriped and marks it as such
(LOV_PATTERN_BITMAP). There are two options for OST object identification:
common object ID and FID-on-OST.
Common Object ID

The MDT tracks a special range of OST object ID?s (?wide stripe objectid? = WSO)
that are used on all OSTs. The MDT assigns the next available WSO to the file,
and this objectid is used on all the OSTs. The OSTs must never use these
objects for regular striped files. A special precreation group for these
objects is probably necessary, as well as orphan cleanup (the MDT should purge
"hole" objects that aren?t allocated from a particular OST). The MDT
should track the last assigned WSO; this will be the starting point for new wide
striped files after recovery. Objects cannot be migrated from one OST to
another, since this would result in out-of-order access. Similarly, stripes can
never be added to holes.
FID-on-OST

Use a mapping of the MDT FID to uniquely determine an OST object. The clients
and MDT add in the OST number to the MDT FID (probably just reserve one sequence
per OST). (This allows the objects to potentially migrate to different OSTs).
The OSTs then internally must map the FID to a local object id. Note this
allows OST-local precreation pools, getting the MDT out of the precreate/orphan
cleanup business and potentially improving create speeds, and also facilitates
"create on write" semantics. The FID can be assigned during the first
access to OST object.
The big problem here is that FID>OBJID ( or better FID->inode id )
translation is absent from the OSTs today. See
http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf
<http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf> (what is the
current state of this?) There is also some work in this direction in the OST
restructuring work (?Orion? WC branch, ORI-300(?), scheduled for Lustre 2.4).

There''s a few questions here, probably the first of which is "is
it worthwhile to spend effort on this, or is BZ4424 good enough?" Then
there is the question of object identification, where FID-on-OST is more
flexible, but also significantly more work (and risk). Also, I thought I
understood from the EOFS Summit that WC also has a separate FID-on-OST project
(separate from Orion that is) -- can someone tell me the state of that?

______________________________________________________________________
This email may contain privileged or confidential information, which should only
be used for the purpose for which it was sent by Xyratex. No further rights or
licenses are granted to use such information. If you are not the intended
recipient of this message, please notify the sender by return and delete it. You
may not use, copy, disclose or rely on the information contained in it.

Internet email is susceptible to data corruption, interception and unauthorised
amendment for which Xyratex does not accept liability. While we have taken
reasonable precautions to ensure that this email is free of viruses, Xyratex
does not accept liability for the presence of any computer viruses in this
email, nor for any losses caused as a result of viruses.

Xyratex Technology Limited (03134912), Registered in England & Wales,
Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.

The Xyratex group of companies also includes, Xyratex Ltd, registered in
Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia)
Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in
The People''s Republic of China and Xyratex Japan Limited registered in
Japan.
______________________________________________________________________

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20111003/02427d93/attachment.html

David Dillow

2011-Oct-04 00:17 UTC

Lustre devel - Oct 2011 - Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping

[Lustre-devel] Wide striping