Good day, some time ago we discussed that it would be very helpful to store epoch in inode on mds. the perfect solution could be to store epoch in old inode body, but there is no much space for this in the body and with DMU we''ll have this problem again. given the minimal inode size we use on MDS is 512 bytes, we can store upto 13 stripes in the body. larger EAs go to a dedicated block. if we add 8 byte epoch, then we can store upto 12 stripes in the body. so, epoch stored in EA affects only files with exactly 13 stripes. files with different stripes are unaffected at all. couple lesser concerns are: 1) cpu usage 2) epoch on old filesystem with insufficient inode space any objections to use EA to store SOM epoch? thanks, Alex
Alex Zhuravlev wrote:> Good day, > > some time ago we discussed that it would be very helpful to > store epoch in inode on mds. the perfect solution could be > to store epoch in old inode body, but there is no much space > for this in the body and with DMU we''ll have this problem > again. > > given the minimal inode size we use on MDS is 512 bytes, we > can store upto 13 stripes in the body. larger EAs go to a > dedicated block. if we add 8 byte epoch, then we can store > upto 12 stripes in the body. so, epoch stored in EA affects > only files with exactly 13 stripes. files with different > stripes are unaffected at all. > > couple lesser concerns are: > 1) cpu usage > 2) epoch on old filesystem with insufficient inode space > > any objections to use EA to store SOM epoch? >hi! Can we use IAM for storing epoch? It is fast and does not have such strong size limitations. We could make "epoch" index in mkfs time (like it is done for existing indexes now) and use object''s fid as a key and epoch as value. Thanks. -- umka
hmm. not sure I got it. epoch is per-inode. and we don''t need >1 epoch for any inode. thanks, Alex Yuriy Umanets wrote:> Can we use IAM for storing epoch? It is fast and does not have such > strong size limitations. We could make "epoch" index in mkfs time (like > it is done for existing indexes now) and use object''s fid as a key and > epoch as value. > > Thanks. >
Alex Zhuravlev wrote:> hmm. not sure I got it. epoch is per-inode. and we don''t need >1 epoch > for > any inode. >Yes, right. We will not have few epochs for the inode. I think we need Nikita here as he is author of IAM and may help us. In HEAD we have got OI (Object Index) which purpose is to map object fids into object store cookies (inode + generation). Fid here is the key and inode store info is value. We have only one such mapping entry for any inode. I proposed to have similar mapping, but store SOM epoch for the inode same way. Use fid as key and epoch as value. Nikita, is this correct using of IAM? Thanks.> thanks, Alex > > Yuriy Umanets wrote: >> Can we use IAM for storing epoch? It is fast and does not have such >> strong size limitations. We could make "epoch" index in mkfs time >> (like it is done for existing indexes now) and use object''s fid as a >> key and epoch as value. >> >> Thanks. >> >-- umka
> hi! > > Can we use IAM for storing epoch? It is fast and does not have such > strong size limitations.there are no size limitations, EA can be stored in a separate block, we just want to minimize IO.> We could make "epoch" index in mkfs time (like > it is done for existing indexes now) and use object''s fid as a key and > epoch as value.this looks like it will double IO/seeks for each inode. -- Vitaly
Vitaly Fertman wrote:>> hi! >> >> Can we use IAM for storing epoch? It is fast and does not have such >> strong size limitations. >> > there are no size limitations, EA can be stored in a separate > block, we just want to minimize IO. > >EA is separate block is evil. It makes things slow.>> We could make "epoch" index in mkfs time (like >> it is done for existing indexes now) and use object''s fid as a key and >> epoch as value. >> > this looks like it will double IO/seeks for each inode. > >Well, it did not in cmd3 :) -- umka
Yuriy Umanets wrote:> EA is separate block is evil. It makes things slow.we have fast EAs (stored in inode, this is why we make them large) for years.> Well, it did not in cmd3 :)if it isn''t stored in inode, it''s a seek. thanks, Alex
Alex Zhuravlev wrote:> Yuriy Umanets wrote: > >> EA is separate block is evil. It makes things slow. >> > > we have fast EAs (stored in inode, this is why we make them large) for years. >Well, people used horses for ages but this did not stop them from building cars :) Guys, I gave you idea, not worse than using EAs. I will not insist it is great. If you can''t estimate its value yourself, well, let it be. We have such a nice thing as IAM and you keep talking about EAs... Seriously, IMHO what is bad about EAs: 1. You need to control their size, you need to bother; 2. Large-fast inodes make create/lookup slow. You need to load this thing to memory after all. I think this is complement to additional seeks caused by IAM; 3. Storing epoch in EA makes you use this chain to access epoch: fid->inode->epoch (in EA), IAM makes it shorter: fid->epoch (in IAM); 4. Large inodes consume more RAM; 5. There others... but they are less related to technical downsides/advantages so I will omit them. Thanks. -- umka
I guess there is some sort of misunderstanding here. we don''t need fid->epoch mapping. we only need epoch along with other inode attributes. epoch is fixed size (8 bytes, probably few more for flags in future) thanks, Alex Yuriy Umanets wrote:> Alex Zhuravlev wrote: >> Yuriy Umanets wrote: >> >>> EA is separate block is evil. It makes things slow. >>> >> we have fast EAs (stored in inode, this is why we make them large) for years. >> > Well, people used horses for ages but this did not stop them from > building cars :) Guys, I gave you idea, not worse than using EAs. I will > not insist it is great. If you can''t estimate its value yourself, well, > let it be. We have such a nice thing as IAM and you keep talking about > EAs... > > Seriously, IMHO what is bad about EAs: > > 1. You need to control their size, you need to bother; > 2. Large-fast inodes make create/lookup slow. You need to load this > thing to memory after all. I think this is complement to additional > seeks caused by IAM; > 3. Storing epoch in EA makes you use this chain to access epoch: > fid->inode->epoch (in EA), IAM makes it shorter: fid->epoch (in IAM); > 4. Large inodes consume more RAM; > 5. There others... but they are less related to technical > downsides/advantages so I will omit them. > > Thanks. >
btw, are you proposing to store LOV in global IAM? thanks, Alex Yuriy Umanets wrote:> Seriously, IMHO what is bad about EAs: > > 1. You need to control their size, you need to bother; > 2. Large-fast inodes make create/lookup slow. You need to load this > thing to memory after all. I think this is complement to additional > seeks caused by IAM; > 3. Storing epoch in EA makes you use this chain to access epoch: > fid->inode->epoch (in EA), IAM makes it shorter: fid->epoch (in IAM); > 4. Large inodes consume more RAM; > 5. There others... but they are less related to technical > downsides/advantages so I will omit them. > > Thanks. >
Alex Zhuravlev wrote:> I guess there is some sort of misunderstanding here. > > we don''t need fid->epoch mapping. we only need epoch along with other > inode attributes. epoch is fixed size (8 bytes, probably few more for > flags in future) > >Alex, Yes, this is what I understand as well. And we were discussing that EA approach has some downsides. In fact what you propose, that is, store it in EA is logical taking into account that epoch is kind of extension to inode fields. It is property of inode-object. It is logical to store it with inode, I see your point. But as we saw, this has/may have some downsides which may be solved with IAM. Just take this in mind when you think/work on it. I do not see why IAM is such a bad here. Thanks.> thanks, Alex > > > Yuriy Umanets wrote: > >> Alex Zhuravlev wrote: >> >>> Yuriy Umanets wrote: >>> >>> >>>> EA is separate block is evil. It makes things slow. >>>> >>>> >>> we have fast EAs (stored in inode, this is why we make them large) for years. >>> >>> >> Well, people used horses for ages but this did not stop them from >> building cars :) Guys, I gave you idea, not worse than using EAs. I will >> not insist it is great. If you can''t estimate its value yourself, well, >> let it be. We have such a nice thing as IAM and you keep talking about >> EAs... >> >> Seriously, IMHO what is bad about EAs: >> >> 1. You need to control their size, you need to bother; >> 2. Large-fast inodes make create/lookup slow. You need to load this >> thing to memory after all. I think this is complement to additional >> seeks caused by IAM; >> 3. Storing epoch in EA makes you use this chain to access epoch: >> fid->inode->epoch (in EA), IAM makes it shorter: fid->epoch (in IAM); >> 4. Large inodes consume more RAM; >> 5. There others... but they are less related to technical >> downsides/advantages so I will omit them. >> >> Thanks. >> >> > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >-- umka
Alex Zhuravlev wrote:> btw, are you proposing to store LOV in global IAM?by "LOV" you mean LOV EA? If yes, well, this is too radical idea seems, but it may be worse to think on. Finally using IAM with it will cost almost nothing in meaning of additional development. IAM should be ready for that. Nikita, is there any limitations for value size in IAM? Thanks.> > thanks, Alex > > Yuriy Umanets wrote: >> Seriously, IMHO what is bad about EAs: >> >> 1. You need to control their size, you need to bother; >> 2. Large-fast inodes make create/lookup slow. You need to load this >> thing to memory after all. I think this is complement to additional >> seeks caused by IAM; >> 3. Storing epoch in EA makes you use this chain to access epoch: >> fid->inode->epoch (in EA), IAM makes it shorter: fid->epoch (in IAM); >> 4. Large inodes consume more RAM; >> 5. There others... but they are less related to technical >> downsides/advantages so I will omit them. >> >> Thanks. >> >-- umka
Yuriy Umanets writes: > Alex Zhuravlev wrote: > > btw, are you proposing to store LOV in global IAM? > by "LOV" you mean LOV EA? If yes, well, this is too radical idea seems, > but it may be worse to think on. Finally using IAM with it will cost > almost nothing in meaning of additional development. IAM should be ready > for that. > > Nikita, is there any limitations for value size in IAM? Htree shift code will be upset if key+value are larger than one fourth of a block, but that''s easy to fix. > > Thanks. Nikita.
Yuriy Umanets wrote:> Alex Zhuravlev wrote: > >> btw, are you proposing to store LOV in global IAM? >> > by "LOV" you mean LOV EA? If yes, well, this is too radical idea seems, > but it may be worse to think on. Finally using IAM with it will cost >s/worse/valuable/ -- umka
Yuriy Umanets wrote:> Alex Zhuravlev wrote: >> btw, are you proposing to store LOV in global IAM? > by "LOV" you mean LOV EA? If yes, well, this is too radical idea seems, > but it may be worse to think on. Finally using IAM with it will cost > almost nothing in meaning of additional development. IAM should be ready > for that.it will cost additional seek to access something through IAM. same applies to LOV and to epoch. thanks, Alex
Nikita Danilov wrote:> Yuriy Umanets writes: > > Alex Zhuravlev wrote: > > > btw, are you proposing to store LOV in global IAM? > > by "LOV" you mean LOV EA? If yes, well, this is too radical idea seems, > > but it may be worse to think on. Finally using IAM with it will cost > > almost nothing in meaning of additional development. IAM should be ready > > for that. > > > > Nikita, is there any limitations for value size in IAM? > > Htree shift code will be upset if key+value are larger than one fourth > of a block, but that''s easy to fix. >This is in fact interesting idea. An object (inode + EA, etc) always gets more and more info while adding new features and one day we will face the need to get rid of EA seems because it''s too big. Thanks.> > > > > Thanks. > > Nikita. >-- umka
Yuriy Umanets wrote:> Yes, this is what I understand as well. And we were discussing that EA > approach has some downsides. In fact what you propose, that is, store it > in EA is logical taking into account that epoch is kind of extension to > inode fields. It is property of inode-object. It is logical to store it > with inode, I see your point. But as we saw, this has/may have some > downsides which may be solved with IAM. Just take this in mind when you > think/work on it. I do not see why IAM is such a bad here.1) additional seek(s) 2) shared structure (additional cost on concurrent access) 3) inode is already 512 bytes thanks, Alex
On Tue, 19 Feb 2008 15:02:02 +0300, Yuriy Umanets <Yury.Umanets at Sun.COM> wrote:> Alex Zhuravlev wrote: >> Yuriy Umanets wrote: >> >>> EA is separate block is evil. It makes things slow. >>> >> >> we have fast EAs (stored in inode, this is why we make them large) for >> years. >> > Well, people used horses for ages but this did not stop them from > building cars :) Guys, I gave you idea, not worse than using EAs. I will > not insist it is great. If you can''t estimate its value yourself, well, > let it be. We have such a nice thing as IAM and you keep talking about > EAs... > > Seriously, IMHO what is bad about EAs: > > 1. You need to control their size, you need to bother; > 2. Large-fast inodes make create/lookup slow. You need to load this > thing to memory after all. I think this is complement to additional > seeks caused by IAM;but this is still better than extra block for EA or IAM. Btw IAM data is also in memory and takes it no less than extra inode size possibly> 3. Storing epoch in EA makes you use this chain to access epoch: > fid->inode->epoch (in EA), IAM makes it shorter: fid->epoch (in IAM);not true actually. inode will be read anyway until you are proposing to put whole inode body in IAM, so there is no benefits. Moreover inode->ea is direct mapping while fid->epoch will need index lookup and may invoke several blocks to read if IAM is large and it will be large in this case, so IO will be not better than even EA in extra block.> 4. Large inodes consume more RAM;this is the same as 2. Guys, don''t forget about DMU as well. -- Mikhail Pershin Staff Engineer Lustre Group Sun Microsystems, Inc
Alex Zhuravlev wrote:> Yuriy Umanets wrote: > >> Yes, this is what I understand as well. And we were discussing that EA >> approach has some downsides. In fact what you propose, that is, store it >> in EA is logical taking into account that epoch is kind of extension to >> inode fields. It is property of inode-object. It is logical to store it >> with inode, I see your point. But as we saw, this has/may have some >> downsides which may be solved with IAM. Just take this in mind when you >> think/work on it. I do not see why IAM is such a bad here. >> > > 1) additional seek(s) > 2) shared structure (additional cost on concurrent access) > 3) inode is already 512 bytes > >Agreed, but this is all not measured and may happen that IAM is not worse but more handy in many respects. Thanks.> thanks, Alex > > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >-- umka
On Tue, 2008-02-19 at 17:59 +0300, Mikhail Pershin wrote:> On Tue, 19 Feb 2008 15:02:02 +0300, Yuriy Umanets <Yury.Umanets at Sun.COM> > wrote: > > > Alex Zhuravlev wrote: > >> Yuriy Umanets wrote: > >> > >>> EA is separate block is evil. It makes things slow. > >>> > >> > >> we have fast EAs (stored in inode, this is why we make them large) for > >> years. > >> > > Well, people used horses for ages but this did not stop them from > > building cars :) Guys, I gave you idea, not worse than using EAs. I will > > not insist it is great. If you can''t estimate its value yourself, well, > > let it be. We have such a nice thing as IAM and you keep talking about > > EAs... > > > > Seriously, IMHO what is bad about EAs: > > > > 1. You need to control their size, you need to bother; > > 2. Large-fast inodes make create/lookup slow. You need to load this > > thing to memory after all. I think this is complement to additional > > seeks caused by IAM; > > but this is still better than extra block for EA or IAM. Btw IAM data is > also in memory and takes it no less than extra inode size possibly > > > 3. Storing epoch in EA makes you use this chain to access epoch: > > fid->inode->epoch (in EA), IAM makes it shorter: fid->epoch (in IAM); > > not true actually. inode will be read anyway until you are proposing to > put whole inode body in IAM, so there is no benefits. Moreover inode->ea > is direct mapping while fid->epoch will need index lookup and may invoke > several blocks to read if IAM is large and it will be large in this case, > so IO will be not better than even EA in extra block. > > > 4. Large inodes consume more RAM; > > this is the same as 2. > > Guys, don''t forget about DMU as well.For the DMU, we will be using 1024-byte dnodes by default to store the striping information. So the epoch can be stored in the in-dnode system attributes. The epoch will need to be stored in an external block or FatZap (depending on implementation of in-dnode EAs) only in-case the file is striped across more than 10-15 OSTs. (The exact number of striped will again depend on the design of in-dnode EAs) Thanks, Kalpak.>
Mikhail Pershin wrote:> On Tue, 19 Feb 2008 15:02:02 +0300, Yuriy Umanets <Yury.Umanets at Sun.COM> > wrote: > > >> Alex Zhuravlev wrote: >> >>> Yuriy Umanets wrote: >>> >>> >>>> EA is separate block is evil. It makes things slow. >>>> >>>> >>> we have fast EAs (stored in inode, this is why we make them large) for >>> years. >>> >>> >> Well, people used horses for ages but this did not stop them from >> building cars :) Guys, I gave you idea, not worse than using EAs. I will >> not insist it is great. If you can''t estimate its value yourself, well, >> let it be. We have such a nice thing as IAM and you keep talking about >> EAs... >> >> Seriously, IMHO what is bad about EAs: >> >> 1. You need to control their size, you need to bother; >> 2. Large-fast inodes make create/lookup slow. You need to load this >> thing to memory after all. I think this is complement to additional >> seeks caused by IAM; >> > > but this is still better than extra block for EA or IAM. Btw IAM data is > also in memory and takes it no less than extra inode size possibly >If it is in memory it will generate less seeks :-)> >> 3. Storing epoch in EA makes you use this chain to access epoch: >> fid->inode->epoch (in EA), IAM makes it shorter: fid->epoch (in IAM); >> > > not true actually. inode will be read anyway until you are proposing to > put whole inode body in IAM, so there is no benefits. Moreover inode->ea > is direct mapping while fid->epoch will need index lookup and may invoke > several blocks to read if IAM is large and it will be large in this case, > so IO will be not better than even EA in extra block. > >I did not mean to put whole inode in IAM. I meant only put there fid as key and epoch as value. So way to access epoch is shorter with IAM as no need to load inode. But these all need to be well thought as all your mention more seeks, new reads, etc.>> 4. Large inodes consume more RAM; >> > > this is the same as 2. > > Guys, don''t forget about DMU as well. > >-- umka
On Ter, 2008-02-19 at 17:59 +0300, Mikhail Pershin wrote:> Guys, don''t forget about DMU as well.For the DMU, we haven''t reached a consensus on a final design for EAs in dnode with the ZFS team yet. The ZFS team proposed having variably-sized system attributes (with integer indexes) instead of having name-value attributes like ext3. I guess this is another good point to discuss in today''s ZFS team meeting. Thanks, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080219/1241151b/attachment-0004.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080219/1241151b/attachment-0004.gif
Yuriy Umanets wrote:> I did not mean to put whole inode in IAM. I meant only put there fid as > key and epoch as value. So way to access epoch is shorter with IAM as no > need to load inode. But these all need to be well thought as all your > mention more seeks, new reads, etc.I don''t understand benefits of this approach. the idea is to pack frequently accessed data together so that we don''t need additional seeks and load/store these data with a single contiguous IO. thanks, Alex
On Ter, 2008-02-19 at 20:41 +0530, Kalpak Shah wrote:> The epoch will need to be stored in an external block or > FatZap (depending on implementation of in-dnode EAs) only in-case the > file is striped across more than 10-15 OSTs.That may not be true. For example, with Matthew''s proposed design, we could put the epoch as a system attribute with "higher priority" (lower index) than LOV data, which means it would always fit in the dnode even if we have lots of LOVs. Regards, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080219/81be0367/attachment-0004.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080219/81be0367/attachment-0004.gif
> I did not mean to put whole inode in IAM. I meant only put there fid as > key and epoch as value. So way to access epoch is shorter with IAM as no > need to load inode. But these all need to be well thought as all your > mention more seeks, new reads, etc.as Alex already mentioned, we do not need fid->ioepoch mapping. ioepoch is a tag for inode attributes and all them need to be loaded together, i.e. there is a need to load inode anyway. -- Vitaly
Rather than discussing this one EA at a time, should we not consider any other EAs (e.g. being considered in current architecture work) that might contend for space in the inode?> -----Original Message----- > From: lustre-devel-bounces at lists.lustre.org > [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of > Alex Zhuravlev > Sent: 19 February 2008 9:49 AM > To: lustre-devel at lists.lustre.org > Subject: [Lustre-devel] storing SOM epoch in EA > > Good day, > > some time ago we discussed that it would be very helpful to > store epoch in inode on mds. the perfect solution could be > to store epoch in old inode body, but there is no much space > for this in the body and with DMU we''ll have this problem > again. > > given the minimal inode size we use on MDS is 512 bytes, we > can store upto 13 stripes in the body. larger EAs go to a > dedicated block. if we add 8 byte epoch, then we can store > upto 12 stripes in the body. so, epoch stored in EA affects > only files with exactly 13 stripes. files with different > stripes are unaffected at all. > > couple lesser concerns are: > 1) cpu usage > 2) epoch on old filesystem with insufficient inode space > > any objections to use EA to store SOM epoch? > > thanks, Alex > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
well, this was one of the reasons I asked lustre-devel@ for inputs. what else we do consider important to store in inode. probably we should list all existing and planned EAs and rank them. also, I''ve got to think that for some cases we don''t need to load LOV EA. for example, for getattr in case of SOM (size/blocks are cached on MDS). or for revalidation when client already has LOV EA. thanks, Alex Eric Barton wrote:> Rather than discussing this one EA at a time, should we not > consider any other EAs (e.g. being considered in current > architecture work) that might contend for space in the inode? > >> -----Original Message----- >> From: lustre-devel-bounces at lists.lustre.org >> [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of >> Alex Zhuravlev >> Sent: 19 February 2008 9:49 AM >> To: lustre-devel at lists.lustre.org >> Subject: [Lustre-devel] storing SOM epoch in EA >> >> Good day, >> >> some time ago we discussed that it would be very helpful to >> store epoch in inode on mds. the perfect solution could be >> to store epoch in old inode body, but there is no much space >> for this in the body and with DMU we''ll have this problem >> again. >> >> given the minimal inode size we use on MDS is 512 bytes, we >> can store upto 13 stripes in the body. larger EAs go to a >> dedicated block. if we add 8 byte epoch, then we can store >> upto 12 stripes in the body. so, epoch stored in EA affects >> only files with exactly 13 stripes. files with different >> stripes are unaffected at all. >> >> couple lesser concerns are: >> 1) cpu usage >> 2) epoch on old filesystem with insufficient inode space >> >> any objections to use EA to store SOM epoch? >> >> thanks, Alex >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel >> > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Alex Zhuravlev writes: > > Yuriy Umanets wrote: > > EA is separate block is evil. It makes things slow. > > we have fast EAs (stored in inode, this is why we make them large) for years. > > > Well, it did not in cmd3 :) > > if it isn''t stored in inode, it''s a seek. One possible point here is that OSD has to do fid->ino translation anyway, and it makes sense to use the same index to store other information besides inode number. That is, we can use "object index" iam file to map fid into (ino, gen, ioepoch, LOV, ...) records, and this would not cause any additional seeks. The downside here is that object index is so heavily used, that making its records larger is going to increase the amount of IO significantly, so it only worth to place there things that we are absolutely sure will be needed for every inode. > > thanks, Alex Nikita.
Nikita Danilov wrote:> One possible point here is that OSD has to do fid->ino translation > anyway, and it makes sense to use the same index to store other > information besides inode number. That is, we can use "object index" iam > file to map fid into (ino, gen, ioepoch, LOV, ...) records, and this > would not cause any additional seeks. The downside here is that object > index is so heavily used, that making its records larger is going to > increase the amount of IO significantly, so it only worth to place there > things that we are absolutely sure will be needed for every inode.well, we also need to update ioepoch when we update size/blocks. thanks, Alex
On Feb 19, 2008 16:30 +0200, Yuriy Umanets wrote:> Alex Zhuravlev wrote: > by "LOV" you mean LOV EA? If yes, well, this is too radical idea seems, > but it may be worse to think on. Finally using IAM with it will cost > almost nothing in meaning of additional development. IAM should be ready > for that. > > Nikita, is there any limitations for value size in IAM?One of the major problems with IAM is that e2fsck doesn''t work with it, it will only exist for ldiskfs (though ZAP works for DMU), and there is a consistency issue between items stored in IAM and in rest of filesystem. If e2fsck deletes an inode, it will not delete entry in IAM, so now we have to patch e2fsck to understand not only IAM, but also specific uses of IAM that link items there to inodes in another place. I don''t think that introducing dependence on IAM is practical. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Feb 19, 2008 15:31 +0000, Eric Barton wrote:> Rather than discussing this one EA at a time, should we not > consider any other EAs (e.g. being considered in current > architecture work) that might contend for space in the inode?I wouldn''t object to this. There were several other proposals to add EAs to the inode, but individually the overhead is high. If we have a single aggregated EA struct for Lustre that would be more reasonable.> > -----Original Message----- > > From: lustre-devel-bounces at lists.lustre.org > > [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of > > Alex Zhuravlev > > Sent: 19 February 2008 9:49 AM > > To: lustre-devel at lists.lustre.org > > Subject: [Lustre-devel] storing SOM epoch in EA > > > > Good day, > > > > some time ago we discussed that it would be very helpful to > > store epoch in inode on mds. the perfect solution could be > > to store epoch in old inode body, but there is no much space > > for this in the body and with DMU we''ll have this problem > > again. > > > > given the minimal inode size we use on MDS is 512 bytes, we > > can store upto 13 stripes in the body. larger EAs go to a > > dedicated block. if we add 8 byte epoch, then we can store > > upto 12 stripes in the body. so, epoch stored in EA affects > > only files with exactly 13 stripes. files with different > > stripes are unaffected at all. > > > > couple lesser concerns are: > > 1) cpu usage > > 2) epoch on old filesystem with insufficient inode space > > > > any objections to use EA to store SOM epoch? > > > > thanks, Alex > > _______________________________________________ > > Lustre-devel mailing list > > Lustre-devel at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-devel > > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-develCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Vitaly Fertman
2008-Feb-19 20:33 UTC
[Lustre-devel] on-disk SOM attributes [former storing SOM epoch in EA]
Hi All, Besides the question Alex asked, there are some more issues I would like to discuss, so let me list all of them here. 1) where to store on-disk IOepoch on MDS -- this question was described in the Alex''s initial "storing SOM epoch in EA" email, so I will not repeat it here. 2) where to store SOM-ENABLE flag in inode? currently it is stored in inode flags, but it may be not acceptable for DMU. If so, we will probably need to move it to the place we will store on-disk IOepoch in (EA?). I also want to mention that on-disk IOepoch is needed at the attribute update time only, to be sure we write newer attributes. Whereas SOM-ENABLE flag is needed more often, thus it is also checked when we tell a client size is valid at getattr. 3) how to avoid e2fsck zeroing i_blocks on MDS? we could patch e2fsck, or alternatively store i_blocks copy in inode that fsck does not know about, e.g. in the same EA. As i_blocks is needed on each getattr, it is worth to store it along with SOM-ENABLE flag. Please advise. -- Vitaly
Alex Zhuravlev writes: > Nikita Danilov wrote: > > One possible point here is that OSD has to do fid->ino translation > > anyway, and it makes sense to use the same index to store other > > information besides inode number. That is, we can use "object index" iam > > file to map fid into (ino, gen, ioepoch, LOV, ...) records, and this > > would not cause any additional seeks. The downside here is that object > > index is so heavily used, that making its records larger is going to > > increase the amount of IO significantly, so it only worth to place there > > things that we are absolutely sure will be needed for every inode. > > well, we also need to update ioepoch when we update size/blocks. Indeed. We still can do this for read-only and read-mostly attributes, like LOV and avoid dirtying extra blocks in the common case. > > thanks, Alex Nikita.