Hi, I''m new here, can anybody tell me how the lustre manage the data stripes within one OST? suppose there are multiple disks in one OST. Thanks, Jaln _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Christopher J. Morrone
2013-Jun-13 00:09 UTC
Re: how the lustre distribute data among disks within one OST
Lustre does not manage the individual disks. I sits on top of a filesystem, either ldiskfs(basically ext4) or zfs (as of Lustre 2.4). You group multiple disks into a single block devices or filesystem using any of the normal mechanisms: hardware raid, linux mdraid, zfs pools, etc. On 06/12/2013 04:51 PM, Jaln wrote:> Hi, > I''m new here, can anybody tell me how the lustre manage the data stripes > within one OST? > suppose there are multiple disks in one OST. > > Thanks, > Jaln
E.S. Rosenberg
2013-Jun-13 12:19 UTC
Re: how the lustre distribute data among disks within one OST
On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone <morrone2-i2BcT+NCU+M@public.gmane.org> wrote:> Lustre does not manage the individual disks. I sits on top of a > filesystem, either ldiskfs(basically ext4) or zfs (as of Lustre 2.4).Is ZFS the recommended fs, or just an option? Doesn''t ZFS suffer major performance drawbacks on linux due to it living in userspace? Thanks, Eli> > You group multiple disks into a single block devices or filesystem using > any of the normal mechanisms: hardware raid, linux mdraid, zfs pools, etc. > > On 06/12/2013 04:51 PM, Jaln wrote: >> Hi, >> I''m new here, can anybody tell me how the lustre manage the data stripes >> within one OST? >> suppose there are multiple disks in one OST. >> >> Thanks, >> Jaln > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Christopher J. Morrone
2013-Jun-13 17:22 UTC
Re: how the lustre distribute data among disks within one OST
On 06/13/2013 05:19 AM, E.S. Rosenberg wrote:> On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone > <morrone2-i2BcT+NCU+M@public.gmane.org> wrote: >> Lustre does not manage the individual disks. I sits on top of a >> filesystem, either ldiskfs(basically ext4) or zfs (as of Lustre 2.4). > Is ZFS the recommended fs, or just an option? > Doesn''t ZFS suffer major performance drawbacks on linux due to it > living in userspace? > Thanks, > EliLLNL (Brian Behlendorf) ported ZFS natively to Linux. We are not using the FUSE (userspace) version. You can find it at: http://zfsonlinux.org ZFS is one of the two backend filesystem options for Lustre, as of Lustre 2.4. 2.4 is the first Lustre release that fully supports using ZFS. Here at LLNL we are using it on our newest, and largest at 55PB, filesystem. Chris
if I have 6 stripes, 2 OST, using round-robin striping, stripe 0,2,4 will be on OST0, stripe 1,3,5 will be on OST1, Do you guys have any idea about what will be the difference of accessing stripe 0,4 vs stripe 0,2? stripe 0, 2 seems to be closer than 0,4, or the lustre will do some intelligent work? Jaln On Thu, Jun 13, 2013 at 10:22 AM, Christopher J. Morrone <morrone2-i2BcT+NCU+M@public.gmane.org>wrote:> On 06/13/2013 05:19 AM, E.S. Rosenberg wrote: > > On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone > > <morrone2-i2BcT+NCU+M@public.gmane.org> wrote: > >> Lustre does not manage the individual disks. I sits on top of a > >> filesystem, either ldiskfs(basically ext4) or zfs (as of Lustre 2.4). > > Is ZFS the recommended fs, or just an option? > > Doesn''t ZFS suffer major performance drawbacks on linux due to it > > living in userspace? > > Thanks, > > Eli > > LLNL (Brian Behlendorf) ported ZFS natively to Linux. We are not using > the FUSE (userspace) version. You can find it at: > > http://zfsonlinux.org > > ZFS is one of the two backend filesystem options for Lustre, as of > Lustre 2.4. 2.4 is the first Lustre release that fully supports using > ZFS. Here at LLNL we are using it on our newest, and largest at 55PB, > filesystem. > > Chris > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Genius only means hard-working all one''s life _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Christopher J. Morrone
2013-Jun-13 21:54 UTC
Re: how the lustre distribute data among disks within one OST
I think you may be confused about what a stripe is in Lustre. If there are only 2 OST, then you can only stripe a file across 2. Or maybe I don''t understand your terminology. I don''t know what you mean by "0,4" and "0,2". On 06/13/2013 02:38 PM, Jaln wrote:> if I have 6 stripes, 2 OST, using round-robin striping, > stripe 0,2,4 will be on OST0, > stripe 1,3,5 will be on OST1, > Do you guys have any idea about what will be the difference of accessing > stripe 0,4 vs stripe 0,2? > stripe 0, 2 seems to be closer than 0,4, or the lustre will do > some intelligent work? > > Jaln > > > On Thu, Jun 13, 2013 at 10:22 AM, Christopher J. Morrone > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>> wrote: > > On 06/13/2013 05:19 AM, E.S. Rosenberg wrote: > > On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone > > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>> wrote: > >> Lustre does not manage the individual disks. I sits on top of a > >> filesystem, either ldiskfs(basically ext4) or zfs (as of Lustre > 2.4). > > Is ZFS the recommended fs, or just an option? > > Doesn''t ZFS suffer major performance drawbacks on linux due to it > > living in userspace? > > Thanks, > > Eli > > LLNL (Brian Behlendorf) ported ZFS natively to Linux. We are not using > the FUSE (userspace) version. You can find it at: > > http://zfsonlinux.org > > ZFS is one of the two backend filesystem options for Lustre, as of > Lustre 2.4. 2.4 is the first Lustre release that fully supports using > ZFS. Here at LLNL we are using it on our newest, and largest at 55PB, > filesystem. > > Chris > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org <mailto:Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > -- > > Genius only means hard-working all one''s life >
Oh, I mean there is one file, for example 6 MB, the stripe size is 1MB, and only 2 OST, then the file will be divided into 6 stripes, denoted as stripe 0,1,2,3,4,5. the distribution on the 2 OST would be stripe 0,2,4 on OST0, stripe 1,3,5 on OST1. Jaln On Thu, Jun 13, 2013 at 2:54 PM, Christopher J. Morrone <morrone2-i2BcT+NCU+M@public.gmane.org>wrote:> I think you may be confused about what a stripe is in Lustre. If there > are only 2 OST, then you can only stripe a file across 2. > > Or maybe I don''t understand your terminology. I don''t know what you mean > by "0,4" and "0,2". > > > On 06/13/2013 02:38 PM, Jaln wrote: > >> if I have 6 stripes, 2 OST, using round-robin striping, >> stripe 0,2,4 will be on OST0, >> stripe 1,3,5 will be on OST1, >> Do you guys have any idea about what will be the difference of accessing >> stripe 0,4 vs stripe 0,2? >> stripe 0, 2 seems to be closer than 0,4, or the lustre will do >> some intelligent work? >> >> Jaln >> >> >> On Thu, Jun 13, 2013 at 10:22 AM, Christopher J. Morrone >> <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>> wrote: >> >> On 06/13/2013 05:19 AM, E.S. Rosenberg wrote: >> > On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone >> > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>> wrote: >> >> Lustre does not manage the individual disks. I sits on top of a >> >> filesystem, either ldiskfs(basically ext4) or zfs (as of Lustre >> 2.4). >> > Is ZFS the recommended fs, or just an option? >> > Doesn''t ZFS suffer major performance drawbacks on linux due to it >> > living in userspace? >> > Thanks, >> > Eli >> >> LLNL (Brian Behlendorf) ported ZFS natively to Linux. We are not >> using >> the FUSE (userspace) version. You can find it at: >> >> http://zfsonlinux.org >> >> ZFS is one of the two backend filesystem options for Lustre, as of >> Lustre 2.4. 2.4 is the first Lustre release that fully supports using >> ZFS. Here at LLNL we are using it on our newest, and largest at 55PB, >> filesystem. >> >> Chris >> >> ______________________________**_________________ >> Lustre-discuss mailing list >> Lustre-discuss-aLEFhgZF4x704iTE6pzA/Q@public.gmane.org**org <Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org><mailto: >> Lustre-discuss@lists.**lustre.org <Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org>> >> >> http://lists.lustre.org/**mailman/listinfo/lustre-**discuss<http://lists.lustre.org/mailman/listinfo/lustre-discuss> >> >> >> >> >> -- >> >> Genius only means hard-working all one''s life >> >> >-- Genius only means hard-working all one''s life _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Christopher J. Morrone
2013-Jun-14 00:23 UTC
Re: how the lustre distribute data among disks within one OST
In that case, it is the question part that I do not understand. :) What is "stripe 0,4", why could it be "closer" then "stripe 0,2"? In your example, 0, 2, and 4 are all in the same place. If you file is striped over 2 OSTs, then essentially what happens behind the scenes is that there are two files, one on each OST. But Lustre hides that from you, as a user. Lustre basically does modulo operations to translate a file offset from the file that it presents to the user, into which ost and offset into said ost''s file to use. Does that help at all? Chris On 06/13/2013 02:58 PM, Jaln wrote:> Oh, I mean there is one file, for example 6 MB, the stripe size is 1MB, > and only 2 OST, > then the file will be divided into 6 stripes, denoted as stripe 0,1,2,3,4,5. > the distribution on the 2 OST would be stripe 0,2,4 on OST0, stripe > 1,3,5 on OST1. > > Jaln > > > On Thu, Jun 13, 2013 at 2:54 PM, Christopher J. Morrone > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>> wrote: > > I think you may be confused about what a stripe is in Lustre. If > there are only 2 OST, then you can only stripe a file across 2. > > Or maybe I don''t understand your terminology. I don''t know what you > mean by "0,4" and "0,2". > > > On 06/13/2013 02:38 PM, Jaln wrote: > > if I have 6 stripes, 2 OST, using round-robin striping, > stripe 0,2,4 will be on OST0, > stripe 1,3,5 will be on OST1, > Do you guys have any idea about what will be the difference of > accessing > stripe 0,4 vs stripe 0,2? > stripe 0, 2 seems to be closer than 0,4, or the lustre will do > some intelligent work? > > Jaln > > > On Thu, Jun 13, 2013 at 10:22 AM, Christopher J. Morrone > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>> wrote: > > On 06/13/2013 05:19 AM, E.S. Rosenberg wrote: > > On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone > > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>> wrote: > >> Lustre does not manage the individual disks. I sits > on top of a > >> filesystem, either ldiskfs(basically ext4) or zfs (as > of Lustre > 2.4). > > Is ZFS the recommended fs, or just an option? > > Doesn''t ZFS suffer major performance drawbacks on linux > due to it > > living in userspace? > > Thanks, > > Eli > > LLNL (Brian Behlendorf) ported ZFS natively to Linux. We > are not using > the FUSE (userspace) version. You can find it at: > > http://zfsonlinux.org > > ZFS is one of the two backend filesystem options for > Lustre, as of > Lustre 2.4. 2.4 is the first Lustre release that fully > supports using > ZFS. Here at LLNL we are using it on our newest, and > largest at 55PB, > filesystem. > > Chris > > _________________________________________________ > Lustre-discuss mailing list > Lustre-discuss-aLEFhgZF4x704iTE6pzA/Q@public.gmane.org__org > <mailto:Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org> > <mailto:Lustre-discuss@lists.__lustre.org > <mailto:Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org>> > > http://lists.lustre.org/__mailman/listinfo/lustre-__discuss > <http://lists.lustre.org/mailman/listinfo/lustre-discuss> > > > > > -- > > Genius only means hard-working all one''s life > > > > > > -- > > Genius only means hard-working all one''s life >
Thank you Chris, I''m sort of clear now. In my question, stripe 0,4 means one process wants to access stripe 0 and 4 at the same time. there is another process wants to access both stripe 0 and 2, even though stripe 0, 2, 4 are in the same place (one file), but their offsets are different, i.e., 0 and 2 are contiguous, while from 0 to 4 there is a gap. So my concern is, will the two processes have different I/O cost? In other words, accessing 0 and 4 would take longer time than accessing 0 and 2. Jaln On Thu, Jun 13, 2013 at 5:23 PM, Christopher J. Morrone <morrone2-i2BcT+NCU+M@public.gmane.org>wrote:> In that case, it is the question part that I do not understand. :) What > is "stripe 0,4", why could it be "closer" then "stripe 0,2"? In your > example, 0, 2, and 4 are all in the same place. > > If you file is striped over 2 OSTs, then essentially what happens behind > the scenes is that there are two files, one on each OST. But Lustre hides > that from you, as a user. Lustre basically does modulo operations to > translate a file offset from the file that it presents to the user, into > which ost and offset into said ost''s file to use. > > Does that help at all? > > Chris > > > On 06/13/2013 02:58 PM, Jaln wrote: > >> Oh, I mean there is one file, for example 6 MB, the stripe size is 1MB, >> and only 2 OST, >> then the file will be divided into 6 stripes, denoted as stripe >> 0,1,2,3,4,5. >> the distribution on the 2 OST would be stripe 0,2,4 on OST0, stripe >> 1,3,5 on OST1. >> >> Jaln >> >> >> On Thu, Jun 13, 2013 at 2:54 PM, Christopher J. Morrone >> <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>> wrote: >> >> I think you may be confused about what a stripe is in Lustre. If >> there are only 2 OST, then you can only stripe a file across 2. >> >> Or maybe I don''t understand your terminology. I don''t know what you >> mean by "0,4" and "0,2". >> >> >> On 06/13/2013 02:38 PM, Jaln wrote: >> >> if I have 6 stripes, 2 OST, using round-robin striping, >> stripe 0,2,4 will be on OST0, >> stripe 1,3,5 will be on OST1, >> Do you guys have any idea about what will be the difference of >> accessing >> stripe 0,4 vs stripe 0,2? >> stripe 0, 2 seems to be closer than 0,4, or the lustre will do >> some intelligent work? >> >> Jaln >> >> >> On Thu, Jun 13, 2013 at 10:22 AM, Christopher J. Morrone >> <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> >> <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>> wrote: >> >> On 06/13/2013 05:19 AM, E.S. Rosenberg wrote: >> > On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone >> > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> >> <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>> wrote: >> >> Lustre does not manage the individual disks. I sits >> on top of a >> >> filesystem, either ldiskfs(basically ext4) or zfs (as >> of Lustre >> 2.4). >> > Is ZFS the recommended fs, or just an option? >> > Doesn''t ZFS suffer major performance drawbacks on linux >> due to it >> > living in userspace? >> > Thanks, >> > Eli >> >> LLNL (Brian Behlendorf) ported ZFS natively to Linux. We >> are not using >> the FUSE (userspace) version. You can find it at: >> >> http://zfsonlinux.org >> >> ZFS is one of the two backend filesystem options for >> Lustre, as of >> Lustre 2.4. 2.4 is the first Lustre release that fully >> supports using >> ZFS. Here at LLNL we are using it on our newest, and >> largest at 55PB, >> filesystem. >> >> Chris >> >> ______________________________**___________________ >> Lustre-discuss mailing list >> Lustre-discuss-aLEFhgZF4x704iTE6pzA/Q@public.gmane.org__**org >> <mailto:Lustre-discuss@lists.**lustre.org<Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org> >> > >> <mailto:Lustre-discuss@lists._**_lustre.org >> <mailto:Lustre-discuss@lists.**lustre.org<Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org> >> >> >> >> http://lists.lustre.org/__**mailman/listinfo/lustre-__**discuss<http://lists.lustre.org/__mailman/listinfo/lustre-__discuss> >> >> <http://lists.lustre.org/**mailman/listinfo/lustre-**discuss<http://lists.lustre.org/mailman/listinfo/lustre-discuss> >> > >> >> >> >> >> -- >> >> Genius only means hard-working all one''s life >> >> >> >> >> >> -- >> >> Genius only means hard-working all one''s life >> >> >-- Genius only means hard-working all one''s life _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Christopher J. Morrone
2013-Jun-14 01:01 UTC
Re: how the lustre distribute data among disks within one OST
Well, that is really more of a question for the backend filesystem in that case. From Lustre''s perspective there is very little difference, especially since our RPC size is capped at 1MB currently (although that may change in future versions). It would probably make more difference to the backend filesystem and storage devices than to Lustre itself. But, of course, the devil is always in the details. Chris On 06/13/2013 05:36 PM, Jaln wrote:> Thank you Chris, I''m sort of clear now. > In my question, stripe 0,4 means one process wants to access stripe 0 > and 4 at the same time. > there is another process wants to access both stripe 0 and 2, > even though stripe 0, 2, 4 are in the same place (one file), > but their offsets are different, i.e., 0 and 2 are contiguous, while > from 0 to 4 there is a gap. > So my concern is, will the two processes have different I/O cost? > In other words, accessing 0 and 4 would take longer time than accessing > 0 and 2. > > Jaln > > On Thu, Jun 13, 2013 at 5:23 PM, Christopher J. Morrone > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>> wrote: > > In that case, it is the question part that I do not understand. :) > What is "stripe 0,4", why could it be "closer" then "stripe 0,2"? > In your example, 0, 2, and 4 are all in the same place. > > If you file is striped over 2 OSTs, then essentially what happens > behind the scenes is that there are two files, one on each OST. But > Lustre hides that from you, as a user. Lustre basically does modulo > operations to translate a file offset from the file that it presents > to the user, into which ost and offset into said ost''s file to use. > > Does that help at all? > > Chris > > > On 06/13/2013 02:58 PM, Jaln wrote: > > Oh, I mean there is one file, for example 6 MB, the stripe size > is 1MB, > and only 2 OST, > then the file will be divided into 6 stripes, denoted as stripe > 0,1,2,3,4,5. > the distribution on the 2 OST would be stripe 0,2,4 on OST0, stripe > 1,3,5 on OST1. > > Jaln > > > On Thu, Jun 13, 2013 at 2:54 PM, Christopher J. Morrone > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>> wrote: > > I think you may be confused about what a stripe is in > Lustre. If > there are only 2 OST, then you can only stripe a file across 2. > > Or maybe I don''t understand your terminology. I don''t know > what you > mean by "0,4" and "0,2". > > > On 06/13/2013 02:38 PM, Jaln wrote: > > if I have 6 stripes, 2 OST, using round-robin striping, > stripe 0,2,4 will be on OST0, > stripe 1,3,5 will be on OST1, > Do you guys have any idea about what will be the > difference of > accessing > stripe 0,4 vs stripe 0,2? > stripe 0, 2 seems to be closer than 0,4, or the lustre > will do > some intelligent work? > > Jaln > > > On Thu, Jun 13, 2013 at 10:22 AM, Christopher J. Morrone > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>> > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>>> wrote: > > On 06/13/2013 05:19 AM, E.S. Rosenberg wrote: > > On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. > Morrone > > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>> > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>>> wrote: > >> Lustre does not manage the individual disks. > I sits > on top of a > >> filesystem, either ldiskfs(basically ext4) or > zfs (as > of Lustre > 2.4). > > Is ZFS the recommended fs, or just an option? > > Doesn''t ZFS suffer major performance drawbacks > on linux > due to it > > living in userspace? > > Thanks, > > Eli > > LLNL (Brian Behlendorf) ported ZFS natively to > Linux. We > are not using > the FUSE (userspace) version. You can find it at: > > http://zfsonlinux.org > > ZFS is one of the two backend filesystem options for > Lustre, as of > Lustre 2.4. 2.4 is the first Lustre release that > fully > supports using > ZFS. Here at LLNL we are using it on our newest, and > largest at 55PB, > filesystem. > > Chris > > ___________________________________________________ > Lustre-discuss mailing list > Lustre-discuss-aLEFhgZF4x704iTE6pzA/Q@public.gmane.org____org > <mailto:Lustre-discuss@lists.__lustre.org > <mailto:Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org>> > <mailto:Lustre-discuss@lists. > <mailto:Lustre-discuss@lists.>____lustre.org <http://lustre.org> > <mailto:Lustre-discuss@lists.__lustre.org > <mailto:Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org>>> > > http://lists.lustre.org/____mailman/listinfo/lustre-____discuss > <http://lists.lustre.org/__mailman/listinfo/lustre-__discuss> > > > <http://lists.lustre.org/__mailman/listinfo/lustre-__discuss > <http://lists.lustre.org/mailman/listinfo/lustre-discuss>> > > > > > -- > > Genius only means hard-working all one''s life > > > > > > -- > > Genius only means hard-working all one''s life > > > > > > -- > > Genius only means hard-working all one''s life >
Dilger, Andreas
2013-Jun-14 21:47 UTC
Re: how the lustre distribute data among disks within one OST
On 2013/13/06 6:36 PM, "Jaln" <valiantljk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:>Thank you Chris, I''m sort of clear now. >In my question, stripe 0,4 means one process wants to access stripe 0 and >4 at the same time. >there is another process wants to access both stripe 0 and 2,Just to clarify the Lustre terminology here, if there are only 2 OSTs involved, there will only be two stripes, with index "0" and "1" (each with an arbitrary object ID), one on each OST. In your case, each one will be an object of 3MB in size.>even though stripe 0, 2, 4 are in the same place (one file), >but their offsets are different, i.e., 0 and 2 are contiguous, >while from 0 to 4 there is a gap.Right, this is no different than an application reading from megabytes 0,1 or 0,2 from a local disk filesystem. There will be a seek in the middle, unless the client, OSS, or RAID/disk decide to do readahead on the file or object. If the file is <= 2MB in size (llite.*.max_read_ahead_whole_mb tunable), Lustre will just prefetch the whole file on first access.>So my concern is, will the two processes have different I/O cost? >In other words, accessing 0 and 4 would take longer time than accessing 0 >and 2.Sure, one seek per MB accessed (<= 10ms), but this is relatively close compared to the network transfer time (10ms per MB for 1GigE, 1ms per MB for 10GigE), and this can all be pipelined by the client, which sends up to 8 RPCs concurrently for each OST. Cheers, Andreas>On Thu, Jun 13, 2013 at 5:23 PM, Christopher J. Morrone ><morrone2-i2BcT+NCU+M@public.gmane.org> wrote: > >In that case, it is the question part that I do not understand. :) What >is "stripe 0,4", why could it be "closer" then "stripe 0,2"? In your >example, 0, 2, and 4 are all in the same place. > >If you file is striped over 2 OSTs, then essentially what happens behind >the scenes is that there are two files, one on each OST. But Lustre >hides that from you, as a user. Lustre basically does modulo operations >to translate a file offset from the file that > it presents to the user, into which ost and offset into said ost''s file >to use. > >Does that help at all? > >Chris > > >On 06/13/2013 02:58 PM, Jaln wrote: > >Oh, I mean there is one file, for example 6 MB, the stripe size is 1MB, >and only 2 OST, >then the file will be divided into 6 stripes, denoted as stripe >0,1,2,3,4,5. >the distribution on the 2 OST would be stripe 0,2,4 on OST0, stripe >1,3,5 on OST1. > >Jaln > > >On Thu, Jun 13, 2013 at 2:54 PM, Christopher J. Morrone > ><morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>> wrote: > > I think you may be confused about what a stripe is in Lustre. If > there are only 2 OST, then you can only stripe a file across 2. > > Or maybe I don''t understand your terminology. I don''t know what you > mean by "0,4" and "0,2". > > > On 06/13/2013 02:38 PM, Jaln wrote: > > if I have 6 stripes, 2 OST, using round-robin striping, > stripe 0,2,4 will be on OST0, > stripe 1,3,5 will be on OST1, > Do you guys have any idea about what will be the difference of > accessing > stripe 0,4 vs stripe 0,2? > stripe 0, 2 seems to be closer than 0,4, or the lustre will do > some intelligent work? > > Jaln > > > On Thu, Jun 13, 2013 at 10:22 AM, Christopher J. Morrone > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> > > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>> wrote: > > On 06/13/2013 05:19 AM, E.S. Rosenberg wrote: > > On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone > > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> > > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>> wrote: > >> Lustre does not manage the individual disks. I sits > on top of a > >> filesystem, either ldiskfs(basically ext4) or zfs (as > of Lustre > 2.4). > > Is ZFS the recommended fs, or just an option? > > Doesn''t ZFS suffer major performance drawbacks on linux > due to it > > living in userspace? > > Thanks, > > Eli > > LLNL (Brian Behlendorf) ported ZFS natively to Linux. We > are not using > the FUSE (userspace) version. You can find it at: > > http://zfsonlinux.org > > ZFS is one of the two backend filesystem options for > Lustre, as of > Lustre 2.4. 2.4 is the first Lustre release that fully > supports using > ZFS. Here at LLNL we are using it on our newest, and > largest at 55PB, > filesystem. > > Chris >Cheers, Andreas -- Andreas Dilger Lustre Software Architect Intel High Performance Data Division
Hi Andreas, Thanks a lot,>this can all be pipelined by the client, which sends up to 8 RPCs >concurrently >for each OST.Can you plz explain a little bit about why "this can all be pipelined by the client" how does the client pipeline it? do you mean pipeline the multiple processes? Thanks, Jaln ________________ Jialin Liu Ph.D. student TTU&LBNL http://www.myweb.ttu.edu/jialliu/ On Fri, Jun 14, 2013 at 2:47 PM, Dilger, Andreas <andreas.dilger-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>wrote:> On 2013/13/06 6:36 PM, "Jaln" <valiantljk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > >Thank you Chris, I''m sort of clear now. > >In my question, stripe 0,4 means one process wants to access stripe 0 and > >4 at the same time. > >there is another process wants to access both stripe 0 and 2, > > Just to clarify the Lustre terminology here, if there are only 2 OSTs > involved, > there will only be two stripes, with index "0" and "1" (each with an > arbitrary > object ID), one on each OST. In your case, each one will be an object of > 3MB > in size. > > >even though stripe 0, 2, 4 are in the same place (one file), > >but their offsets are different, i.e., 0 and 2 are contiguous, > >while from 0 to 4 there is a gap. > > Right, this is no different than an application reading from megabytes 0,1 > or 0,2 > from a local disk filesystem. There will be a seek in the middle, unless > the > client, OSS, or RAID/disk decide to do readahead on the file or object. > If the > file is <= 2MB in size (llite.*.max_read_ahead_whole_mb tunable), Lustre > will just prefetch the whole file on first access. > > >So my concern is, will the two processes have different I/O cost? > >In other words, accessing 0 and 4 would take longer time than accessing 0 > >and 2. > > Sure, one seek per MB accessed (<= 10ms), but this is relatively close > compared > to the network transfer time (10ms per MB for 1GigE, 1ms per MB for > 10GigE), and > this can all be pipelined by the client, which sends up to 8 RPCs > concurrently > for each OST. > > Cheers, Andreas > > >On Thu, Jun 13, 2013 at 5:23 PM, Christopher J. Morrone > ><morrone2-i2BcT+NCU+M@public.gmane.org> wrote: > > > >In that case, it is the question part that I do not understand. :) What > >is "stripe 0,4", why could it be "closer" then "stripe 0,2"? In your > >example, 0, 2, and 4 are all in the same place. > > > >If you file is striped over 2 OSTs, then essentially what happens behind > >the scenes is that there are two files, one on each OST. But Lustre > >hides that from you, as a user. Lustre basically does modulo operations > >to translate a file offset from the file that > > it presents to the user, into which ost and offset into said ost''s file > >to use. > > > >Does that help at all? > > > >Chris > > > > > >On 06/13/2013 02:58 PM, Jaln wrote: > > > >Oh, I mean there is one file, for example 6 MB, the stripe size is 1MB, > >and only 2 OST, > >then the file will be divided into 6 stripes, denoted as stripe > >0,1,2,3,4,5. > >the distribution on the 2 OST would be stripe 0,2,4 on OST0, stripe > >1,3,5 on OST1. > > > >Jaln > > > > > >On Thu, Jun 13, 2013 at 2:54 PM, Christopher J. Morrone > > > ><morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>> wrote: > > > > I think you may be confused about what a stripe is in Lustre. If > > there are only 2 OST, then you can only stripe a file across 2. > > > > Or maybe I don''t understand your terminology. I don''t know what you > > mean by "0,4" and "0,2". > > > > > > On 06/13/2013 02:38 PM, Jaln wrote: > > > > if I have 6 stripes, 2 OST, using round-robin striping, > > stripe 0,2,4 will be on OST0, > > stripe 1,3,5 will be on OST1, > > Do you guys have any idea about what will be the difference of > > accessing > > stripe 0,4 vs stripe 0,2? > > stripe 0, 2 seems to be closer than 0,4, or the lustre will do > > some intelligent work? > > > > Jaln > > > > > > On Thu, Jun 13, 2013 at 10:22 AM, Christopher J. Morrone > > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> > > > > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>> wrote: > > > > On 06/13/2013 05:19 AM, E.S. Rosenberg wrote: > > > On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone > > > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> > > > > <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>> wrote: > > >> Lustre does not manage the individual disks. I sits > > on top of a > > >> filesystem, either ldiskfs(basically ext4) or zfs (as > > of Lustre > > 2.4). > > > Is ZFS the recommended fs, or just an option? > > > Doesn''t ZFS suffer major performance drawbacks on linux > > due to it > > > living in userspace? > > > Thanks, > > > Eli > > > > LLNL (Brian Behlendorf) ported ZFS natively to Linux. We > > are not using > > the FUSE (userspace) version. You can find it at: > > > > http://zfsonlinux.org > > > > ZFS is one of the two backend filesystem options for > > Lustre, as of > > Lustre 2.4. 2.4 is the first Lustre release that fully > > supports using > > ZFS. Here at LLNL we are using it on our newest, and > > largest at 55PB, > > filesystem. > > > > Chris > > > > Cheers, Andreas > -- > Andreas Dilger > > Lustre Software Architect > Intel High Performance Data Division > > >-- Genius only means hard-working all one''s life _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Dilger, Andreas
2013-Jun-17 23:18 UTC
Re: how the lustre distribute data among disks within one OST
On 2013/16/06 12:02 AM, "Jaln" <valiantljk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:>Hi Andreas, >Thanks a lot, >>this can all be pipelined by the client, which sends up to 8 RPCs >>concurrently >>for each OST. > >Can you plz explain a little bit about why "this can all be pipelined by >the client" >how does the client pipeline it? >do you mean pipeline the multiple processes?The RPC service on the client node will send up to 8 RPCs write asynchronously before blocking and waiting for a reply. This allows even single-threaded applications to have reasonable IO performance, though still better performance can be seen by multiple userspace threads on the client. The reason is that copy_from_user() in the kernel becomes CPU-bound copying the data from userspace to the kernel buffers. Using O_DIRECT to avoid this data copy avoids this issue, but introduces a separate issue that O_DIRECT requires data not to be buffered, which Lustre takes to mean "sync''d to disk on the server" so that it is safe in the face of a crash of either client or server, so is not faster unless very large writes are done by the client. Cheers, Andreas>On Fri, Jun 14, 2013 at 2:47 PM, Dilger, Andreas ><andreas.dilger-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote: > >On 2013/13/06 6:36 PM, "Jaln" <valiantljk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >>Thank you Chris, I''m sort of clear now. >>In my question, stripe 0,4 means one process wants to access stripe 0 and >>4 at the same time. >>there is another process wants to access both stripe 0 and 2, > > >Just to clarify the Lustre terminology here, if there are only 2 OSTs >involved, >there will only be two stripes, with index "0" and "1" (each with an >arbitrary >object ID), one on each OST. In your case, each one will be an object of >3MB >in size. > >>even though stripe 0, 2, 4 are in the same place (one file), >>but their offsets are different, i.e., 0 and 2 are contiguous, >>while from 0 to 4 there is a gap. > > >Right, this is no different than an application reading from megabytes 0,1 >or 0,2 >from a local disk filesystem. There will be a seek in the middle, unless >the >client, OSS, or RAID/disk decide to do readahead on the file or object. >If the >file is <= 2MB in size (llite.*.max_read_ahead_whole_mb tunable), Lustre >will just prefetch the whole file on first access. > >>So my concern is, will the two processes have different I/O cost? >>In other words, accessing 0 and 4 would take longer time than accessing 0 >>and 2. > > >Sure, one seek per MB accessed (<= 10ms), but this is relatively close >compared >to the network transfer time (10ms per MB for 1GigE, 1ms per MB for >10GigE), and >this can all be pipelined by the client, which sends up to 8 RPCs >concurrently >for each OST. > >Cheers, Andreas > >>On Thu, Jun 13, 2013 at 5:23 PM, Christopher J. Morrone >><morrone2-i2BcT+NCU+M@public.gmane.org> wrote: >> >>In that case, it is the question part that I do not understand. :) What >>is "stripe 0,4", why could it be "closer" then "stripe 0,2"? In your >>example, 0, 2, and 4 are all in the same place. >> >>If you file is striped over 2 OSTs, then essentially what happens behind >>the scenes is that there are two files, one on each OST. But Lustre >>hides that from you, as a user. Lustre basically does modulo operations >>to translate a file offset from the file that >> it presents to the user, into which ost and offset into said ost''s file >>to use. >> >>Does that help at all? >> >>Chris >> >> >>On 06/13/2013 02:58 PM, Jaln wrote: >> >>Oh, I mean there is one file, for example 6 MB, the stripe size is 1MB, >>and only 2 OST, >>then the file will be divided into 6 stripes, denoted as stripe >>0,1,2,3,4,5. >>the distribution on the 2 OST would be stripe 0,2,4 on OST0, stripe >>1,3,5 on OST1. >> >>Jaln >> >> >>On Thu, Jun 13, 2013 at 2:54 PM, Christopher J. Morrone >> >><morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>> wrote: >> >> I think you may be confused about what a stripe is in Lustre. If >> there are only 2 OST, then you can only stripe a file across 2. >> >> Or maybe I don''t understand your terminology. I don''t know what you >> mean by "0,4" and "0,2". >> >> >> On 06/13/2013 02:38 PM, Jaln wrote: >> >> if I have 6 stripes, 2 OST, using round-robin striping, >> stripe 0,2,4 will be on OST0, >> stripe 1,3,5 will be on OST1, >> Do you guys have any idea about what will be the difference of >> accessing >> stripe 0,4 vs stripe 0,2? >> stripe 0, 2 seems to be closer than 0,4, or the lustre will do >> some intelligent work? >> >> Jaln >> >> >> On Thu, Jun 13, 2013 at 10:22 AM, Christopher J. Morrone >> <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> >> >> <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>> wrote: >> >> On 06/13/2013 05:19 AM, E.S. Rosenberg wrote: >> > On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone >> > <morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org> >> >> <mailto:morrone2-i2BcT+NCU+M@public.gmane.org <mailto:morrone2-i2BcT+NCU+M@public.gmane.org>>> wrote: >> >> Lustre does not manage the individual disks. I sits >> on top of a >> >> filesystem, either ldiskfs(basically ext4) or zfs (as >> of Lustre >> 2.4). >> > Is ZFS the recommended fs, or just an option? >> > Doesn''t ZFS suffer major performance drawbacks on linux >> due to it >> > living in userspace? >> > Thanks, >> > Eli >> >> LLNL (Brian Behlendorf) ported ZFS natively to Linux. We >> are not using >> the FUSE (userspace) version. You can find it at: >> >> http://zfsonlinux.org >> >> ZFS is one of the two backend filesystem options for >> Lustre, as of >> Lustre 2.4. 2.4 is the first Lustre release that fully >> supports using >> ZFS. Here at LLNL we are using it on our newest, and >> largest at 55PB, >> filesystem. >> >> ChrisCheers, Andreas -- Andreas Dilger Lustre Software Architect Intel High Performance Data Division