cc-ing Lustre-devel - this is of general interest.> -----Original Message----- > From: Zhen.Liang at Sun.COM [mailto:Zhen.Liang at Sun.COM] > Sent: 04 May 2009 1:50 PM > To: Eric Barton; Nathan.Rutman at Sun.COM > Cc: ''Robert Read'' > Subject: Re: AT and ptlrpc SMP stuff > > Nathan, Eric > > I actually had a discussion with Eric last week, seems there could be > some problems: > 1. Stealing buffer or CPU load balance > a) Stealing buffer > New LNet always try to get buffer from current CPU to match > request, but if all bufferes on current CPU are exhausted, then it will > steal buffer from other CPUs, and it will wakeup service threads on > other CPU to handle the request as well. So it''s possible that RPCs from > the same client are handled by different CPUs on the server. > b) CPU load balance > RPCs can be dispatched to other CPUs by current CPU, I don''t > know whether it''s necessary to have this(we may benefit very few or > nothing from bouncing RPCs between CPUs), if we have CPU load balance, > then requests can be handled by any CPU. > > 2. Client connected to a lot of routers > If client connected to more than one routers, then requests can be > forwared by any one of these routers. On the server side, requests from > different routers are very likely to be received by different LND > threads, and wake up different ptlrpc service threads on different CPUs, > so there is no bonding of CPU for clients in this case. > > 3. The last one is more about LNet: > Eric, we actually talked about this a bit in East Palo Alto. Based on > current design, if there is only one router, then all requests will be > received by the same LND thread and delivered to ptlrpc service threads > on one CPU, so we will only use one or two CPUs on server, even worse on > the router, all messages are serialized on the peer structure of server, > we still have very high contention.[For people reading on lustre-devel, this refers to the improved SMP scaling work Liang is doing] The problem for a server "hiding" behind a single router centers on the handling of traffic on the link between them - as you mention in another mail, that''s not a problem for the upper layers in the stack since they can distribute the work over CPUs by hashing on the end-to-end peer NID. But I agree that at the lower levels, we need multiple connections each with separate CPU affinity to avoid contention and assure SMP scaling.> The only way I can find to resolve this problem is creating multiple > LNet networks between router and server, so both router and server can > have multiple peers & connections for remote side. Actually, I think > it''s fair for router/server to take more credits, buffers, CPUs on > server/router then other clients. We have two options to get this: > a) static config by user, then we don''t need change anything, but it > will increase complexity of network configuration, and some users may > feel confused. > b) LNet can create sub-networks for network of router&server (we can > make it tunable), requests will be balanced to different sub-networks. > We can make it almost transparent to user, it seems dorable to me but I > haven''t estimated how much efforts we need.The more transparent, the better. If _all_ configuration could be avoided, then so much the better. Multiple connections to the same immediate physical peer at the LND level allow maximum SMP concurrency, but this is best detected/managed in the generic LNET code. This seems to beg for adding explicit connection handling to the LND API and (as we''ve know since forever) would remove a lot of duplication between LNDs. However I think all we do right now is size the work and leave it pending. We _can_ achieve the same effect with explicit configuration and the most important use case right now is the MDS, which should be amply provisioned with routers where it matters. Cheers, Eric> > Any suggestion? > > Thanks > Liang > > > Eric Barton : > > Nathan, > > > > Please talk me through these issues. > > > > > > Cheers, > > Eric > > > > > >> -----Original Message----- > >> From: Nathan.Rutman at Sun.COM [mailto:Nathan.Rutman at Sun.COM] > >> Sent: 01 May 2009 12:45 AM > >> To: Liang Zhen > >> Cc: Eric Barton; Robert Read > >> Subject: Re: AT and ptlrpc SMP stuff > >> > >> Liang Zhen wrote: > >> > >>> Nathan, > >>> > >>> Yes, I don''t know whether eeb has sent you the patch or not, so I put > >>> it in attachment. > >>> > >>> Basically, I move some members from ptlrpc_service to per-cpu data, > >>> and make service threads be cpu affinity by default, in order to get > >>> rid of any possible global lock contention on RPC handling path, i.e > >>> ptlrpc_service::srv_lock. As you know, ptlrpc_service::srv_at_estimate > >>> is global for each service, so I''m thinking to move it to per-cpu data > >>> for two reasons: > >>> 1) at_add(...) needs spinlock, if we keep it on ptlrpc_service, then > >>> it''s a kind of global spin on hot path, now we are sure that any spin > >>> on hot path will be amplified a lot on fat cores machine like 16 or 32. > >>> 2. Requests from same client tend to be handled by same thread & CPU > >>> on server, so I think it''s reasonable to have per-CPU AT estimate etc... > >>> I really know few about this because I just looked into it for few > >>> days, expecting for your advisement for AT or anything about the patch > >>> (it''s still a rough prototype) > >>> > >> I think there is no problem moving the at_estimate and at_lock to > >> per-cpu struct. The server estimates might end up varying by cpu, but > >> since they are collected together by the clients from the RPC reply, the > >> clients will still continue to track the maximum estimate correctly. > >> They might see a little more "jitter" in the service estimate time, but > >> since they use a moving-maximum window (600sec by default), this jitter > >> will all get smoothed out. > >> > >> You will have to change ptlrpc_lprocfs_rd_timeouts to collect the > >> server-side service estimate from the per-cpu estimates, but I don''t > >> think there''s any need even here to do locking across the service > >> threads - just "max" each of the data points across the per-cpu values. > >> Hmm, actually a little trickiness comes in because we print the estimate > >> history (4 data points), but with per-cpu measurements the start time > >> (at_binstart) of the history values may vary. IOW, the history is the > >> maximum estimate within a series of time slices (150s default), but > >> those slices may not line up between cpus. So taking the max is not > >> truly the right thing to do, although it might not be worth much effort > >> to do any better. > >> > >> LCONSOLE_WARN("%s: This server is not able to keep up > >> with " > >> - "request traffic (cpu-bound).\n", > >> svc->srv_name); > >> + "request traffic (cpu-bound).\n", > >> + scd->scd_service->srv_name); > >> You could make this more fun: > >> (cpu bound on cpu #%d).\n", ...scd->scd_cpu_id > >> > > > > > > > >
Eric, Eric Barton wrote:>> The only way I can find to resolve this problem is creating multiple >> LNet networks between router and server, so both router and server can >> have multiple peers & connections for remote side. Actually, I think >> it''s fair for router/server to take more credits, buffers, CPUs on >> server/router then other clients. We have two options to get this: >> a) static config by user, then we don''t need change anything, but it >> will increase complexity of network configuration, and some users may >> feel confused. >> b) LNet can create sub-networks for network of router&server (we can >> make it tunable), requests will be balanced to different sub-networks. >> We can make it almost transparent to user, it seems dorable to me but I >> haven''t estimated how much efforts we need. >> > > The more transparent, the better. If _all_ configuration could be avoided, > then so much the better. Multiple connections to the same immediate > physical peer at the LND level allow maximum SMP concurrency, but this is > best detected/managed in the generic LNET code. This seems to beg for > adding explicit connection handling to the LND API and (as we''ve know since > forever) would remove a lot of duplication between LNDs. >I think we have two options here: 1. As you said, multiple connections the same physical peer at LND level, it''s the ideal way but will take more efforts. 2. I have feeling that this issue could somehow be covered by channel bonding, which should be able to support bonding several LNet networks to one. So we can aggregate throughput of connections on the same physical NI(on different CPUs) as well as different physical NIs. Negativity I can think of is that it will take more preallocated memory resource at LND level for each network(i.e: preallocated TXs etc) Isaac, do you have any thought about this? Thanks Liang> However I think all we do right now is size the work and leave it pending. > We _can_ achieve the same effect with explicit configuration and the most > important use case right now is the MDS, which should be amply provisioned > with routers where it matters. > > Cheers, > Eric > > >> Any suggestion? >> >> Thanks >> Liang >> >> >> Eric Barton : >> >>> Nathan, >>> >>> Please talk me through these issues. >>> >>> >>> Cheers, >>> Eric >>> >>> >>> >>>> -----Original Message----- >>>> From: Nathan.Rutman at Sun.COM [mailto:Nathan.Rutman at Sun.COM] >>>> Sent: 01 May 2009 12:45 AM >>>> To: Liang Zhen >>>> Cc: Eric Barton; Robert Read >>>> Subject: Re: AT and ptlrpc SMP stuff >>>> >>>> Liang Zhen wrote: >>>> >>>> >>>>> Nathan, >>>>> >>>>> Yes, I don''t know whether eeb has sent you the patch or not, so I put >>>>> it in attachment. >>>>> >>>>> Basically, I move some members from ptlrpc_service to per-cpu data, >>>>> and make service threads be cpu affinity by default, in order to get >>>>> rid of any possible global lock contention on RPC handling path, i.e >>>>> ptlrpc_service::srv_lock. As you know, ptlrpc_service::srv_at_estimate >>>>> is global for each service, so I''m thinking to move it to per-cpu data >>>>> for two reasons: >>>>> 1) at_add(...) needs spinlock, if we keep it on ptlrpc_service, then >>>>> it''s a kind of global spin on hot path, now we are sure that any spin >>>>> on hot path will be amplified a lot on fat cores machine like 16 or 32. >>>>> 2. Requests from same client tend to be handled by same thread & CPU >>>>> on server, so I think it''s reasonable to have per-CPU AT estimate etc... >>>>> I really know few about this because I just looked into it for few >>>>> days, expecting for your advisement for AT or anything about the patch >>>>> (it''s still a rough prototype) >>>>> >>>>> >>>> I think there is no problem moving the at_estimate and at_lock to >>>> per-cpu struct. The server estimates might end up varying by cpu, but >>>> since they are collected together by the clients from the RPC reply, the >>>> clients will still continue to track the maximum estimate correctly. >>>> They might see a little more "jitter" in the service estimate time, but >>>> since they use a moving-maximum window (600sec by default), this jitter >>>> will all get smoothed out. >>>> >>>> You will have to change ptlrpc_lprocfs_rd_timeouts to collect the >>>> server-side service estimate from the per-cpu estimates, but I don''t >>>> think there''s any need even here to do locking across the service >>>> threads - just "max" each of the data points across the per-cpu values. >>>> Hmm, actually a little trickiness comes in because we print the estimate >>>> history (4 data points), but with per-cpu measurements the start time >>>> (at_binstart) of the history values may vary. IOW, the history is the >>>> maximum estimate within a series of time slices (150s default), but >>>> those slices may not line up between cpus. So taking the max is not >>>> truly the right thing to do, although it might not be worth much effort >>>> to do any better. >>>> >>>> LCONSOLE_WARN("%s: This server is not able to keep up >>>> with " >>>> - "request traffic (cpu-bound).\n", >>>> svc->srv_name); >>>> + "request traffic (cpu-bound).\n", >>>> + scd->scd_service->srv_name); >>>> You could make this more fun: >>>> (cpu bound on cpu #%d).\n", ...scd->scd_cpu_id >>>> >>>> >>> >>> >>> > > >
Sorry I am an newbie here. My best efforts are to be pardoned. I have recently downloaded lustre and installed on ubuntu 8.10. Though did not have time to go through the code. I am new to lustre. But this seems to be an completely different approach to filesystems like jfs or xfs or even zfs. You seem to divide the interface to an client and seem to have filesystem drivers. Is there any support on LDN base on the Client side? or is setup on the meta data lookup or the Driver. The code ideally does need to look to be divided on the client side with a few stubs in the driver/meta data side. According to current development needs. Thanks, Sujit On Wed, May 6, 2009 at 11:54 AM, Liang Zhen <Zhen.Liang at sun.com> wrote:> Eric, > > Eric Barton wrote: >>> The only way I can find to resolve this problem is creating multiple >>> LNet networks between router and server, so both router and server can >>> have multiple peers & connections for remote side. Actually, I think >>> it''s fair for router/server to take more credits, buffers, CPUs on >>> server/router then other clients. We have two options to get this: >>> ? ?a) ?static config by user, then we don''t need change anything, but it >>> will increase complexity of network configuration, and some users may >>> feel confused. >>> ? ?b) LNet can create sub-networks for network of ?router&server (we can >>> make it tunable), requests will be balanced to different sub-networks. >>> We can make it almost transparent to user, it seems dorable to me but I >>> haven''t estimated how much efforts we need. >>> >> >> The more transparent, the better. ?If _all_ configuration could be avoided, >> then so much the better. ?Multiple connections to the same immediate >> physical peer at the LND level allow maximum SMP concurrency, but this is >> best detected/managed in the generic LNET code. ?This seems to beg for >> adding explicit connection handling to the LND API and (as we''ve know since >> forever) would remove a lot of duplication between LNDs. >> > > I think we have two options here: > 1. As you said, multiple connections the same physical peer at LND > level, it''s the ideal way but will take more efforts. > 2. I have feeling that this issue could somehow be covered by channel > bonding, which should be able to support bonding several LNet networks > to one. So we can aggregate throughput of connections on the same > physical NI(on different CPUs) as well as different physical NIs. > Negativity I can think of is that it will take more preallocated memory > resource at LND level for each network(i.e: preallocated TXs etc) > > Isaac, do you have any thought about this? > > Thanks > Liang > > >> However I think all we do right now is size the work and leave it pending. >> We _can_ achieve the same effect with explicit configuration and the most >> important use case right now is the MDS, which should be amply provisioned >> with routers where it matters. >> >> ? ? Cheers, >> ? ? ? ? ? ? ? Eric >> >> >>> Any suggestion? >>> >>> Thanks >>> Liang >>> >>> >>> Eric Barton : >>> >>>> Nathan, >>>> >>>> Please talk me through these issues. >>>> >>>> >>>> ? ? Cheers, >>>> ? ? ? ? ? ? ? Eric >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: Nathan.Rutman at Sun.COM [mailto:Nathan.Rutman at Sun.COM] >>>>> Sent: 01 May 2009 12:45 AM >>>>> To: Liang Zhen >>>>> Cc: Eric Barton; Robert Read >>>>> Subject: Re: AT and ptlrpc SMP stuff >>>>> >>>>> Liang Zhen wrote: >>>>> >>>>> >>>>>> Nathan, >>>>>> >>>>>> Yes, I don''t know whether eeb has sent you the patch or not, so I put >>>>>> it in attachment. >>>>>> >>>>>> Basically, I move some members from ptlrpc_service to per-cpu data, >>>>>> and make service threads ?be cpu affinity by default, in order to get >>>>>> rid of any possible global lock contention on RPC handling path, i.e >>>>>> ptlrpc_service::srv_lock. As you know, ptlrpc_service::srv_at_estimate >>>>>> is global for each service, so I''m thinking to move it to per-cpu data >>>>>> for two reasons: >>>>>> 1) at_add(...) needs spinlock, if we keep it on ptlrpc_service, then >>>>>> it''s a kind of global spin on hot path, now we are sure that any spin >>>>>> on hot path will be amplified a lot on fat cores machine like 16 or 32. >>>>>> 2. Requests from same client tend to be handled by same thread & CPU >>>>>> on server, so I think it''s reasonable to have per-CPU AT estimate etc... >>>>>> I really know few about this because I just looked into it for few >>>>>> days, expecting for your advisement for AT or anything about the patch >>>>>> (it''s still a rough prototype) >>>>>> >>>>>> >>>>> I think there is no problem moving the at_estimate and at_lock to >>>>> per-cpu struct. ?The server estimates might end up varying by cpu, but >>>>> since they are collected together by the clients from the RPC reply, the >>>>> clients will still continue to track the maximum estimate correctly. >>>>> They might see a little more "jitter" in the service estimate time, but >>>>> since they use a moving-maximum window (600sec by default), this jitter >>>>> will all get smoothed out. >>>>> >>>>> You will have to change ptlrpc_lprocfs_rd_timeouts to collect the >>>>> server-side service estimate from the per-cpu estimates, but I don''t >>>>> think there''s any need even here to do locking across the service >>>>> threads - just "max" each of the data points across the per-cpu values. >>>>> Hmm, actually a little trickiness comes in because we print the estimate >>>>> history (4 data points), but with per-cpu measurements the start time >>>>> (at_binstart) of the history values may vary. ?IOW, the history is the >>>>> maximum estimate within a series of time slices (150s default), but >>>>> those slices may not line up between cpus. ?So taking the max is not >>>>> truly the right thing to do, although it might not be worth much effort >>>>> to do any better. >>>>> >>>>> ? ? ? ? ? ? ? ? ?LCONSOLE_WARN("%s: This server is not able to keep up >>>>> with " >>>>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?"request traffic (cpu-bound).\n", >>>>> svc->srv_name); >>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?"request traffic (cpu-bound).\n", >>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?scd->scd_service->srv_name); >>>>> You could make this more fun: >>>>> (cpu bound on cpu #%d).\n", ...scd->scd_cpu_id >>>>> >>>>> >>>> >>>> >>>> >> >> >> > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >-- -- Sujit K M