Peter J. Braam
2006-Nov-17 10:55 UTC
[Lustre-devel] RE: [Arch] scalability study: single client _CONNECTS_ to a very large number of OSS servers
Hi Niu, This is a good review of the scalability of connections. But there are some questions. I have now cc''d lustre-devel to get the discussion in the open.> -----Original Message----- > From: arch-bounces@clusterfs.com > [mailto:arch-bounces@clusterfs.com] On Behalf Of Niu YaWei > Sent: Wednesday, November 15, 2006 8:29 PM > To: arch@clusterfs.com > Cc: beaver@clusterfs.com > Subject: [Arch] scalability study: >single client _CONNECTS_ to a very large number of OSS servers Review form:> 1. Use case identifier: single client _CONNECTS_ to a very > large number of OSS servers > > 2. Link to architectural information: None > > 3. HLD available: YES > > 4. Patterns of basic operations: > a. RPCs: > - One OST_CONNECT RPC for each OST. > b. fs/obd other methods: > - obd_connect. > c. cache: None. > d. Lustre & Linux locks: No suspect locks. > e. lists, arrays, queues @runtime: > - obd array obd_devs, the maximum device > count is 8k, so the osc count must be less than 8k, I think > it''s enough.Nope - we want this to be far more scalable than 8K OSC''s. Last week we heard that Evan Felix ran with 4000 OSC''s (getting a whopping 130GB/sec read from Lustre! The array needs to go away. I think Nathan is already working on this for the load simulator btw. Hmm, I don''t see a server side consideration of this problem. Am I missing something?> - qos_add_tgt() will search and maintain the > lq_oss_list, this list grows as OSS number grow. > - Need search connection in the > imp_conn_list, but this list is quite small and will never grow. > f. startup data: None. > > 5. Scalable use pattern of basic operations: > - One client perform mount. > - MDS setup.Is connect also used against the management server?> 6. Scalability measures: > - The number of OST_CONNECT RPC is N (OST count), > since the RPC is sent asynchronously, it runs in O(1) time. > - Unless we are going to build a cluster with more > than 8k OSTs, we can''t run out of obd_devs. > - The time complexity of qos_add_tgt() is O(N), and > it should only happen when MDS connect OSS, so no need > to improve it.On the server side isn''t there scanning in the list of existing connections to see if a UUID of a connection is already in the list? Isn''t that list O(N) long? If so, the scan is O(N^2)? Eeb - can you confirm one more time that connection setup, which is likely to happen at this point in LNET has no linear scans?> > 7. Experiment description and findings: > - No test for it.Nathan - will the load simulator do this? I think it could even be used over the net?> 8. Recommendations for improvements: > - No recommendation on implementation improvements.A. Kill the array (P2) B. Fix the searching on the server (P1)> 9. Non scalable issues encountered, not identified by this process: > - The qos lists are useless for client''s lov, but we > have to setup them since MDS and client use > the same lov driver, this needless list maintenance > work will burden each client mount, we should > avoid it.Hmm. But this will change in the future when the client has a full WB cache, so let''s leave them in. Does the setup scale well? - Peter -> > > _______________________________________________ > Arch mailing list > Arch@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/arch
Niu YaWei
2006-Nov-19 20:04 UTC
[Lustre-devel] Re: [Arch] scalability study: single client _CONNECTS_ to a very large number of OSS servers
Peter J. Braam wrote:>Hi Niu, > >This is a good review of the scalability of connections. But there are some >questions. I have now cc''d lustre-devel to get the discussion in the open. > > > >>-----Original Message----- >>From: arch-bounces@clusterfs.com >>[mailto:arch-bounces@clusterfs.com] On Behalf Of Niu YaWei >>Sent: Wednesday, November 15, 2006 8:29 PM >>To: arch@clusterfs.com >>Cc: beaver@clusterfs.com >>Subject: [Arch] scalability study: >> >> >> > >single client _CONNECTS_ to a very large number of OSS servers > >Review form: > > > >>1. Use case identifier: single client _CONNECTS_ to a very >>large number of OSS servers >> >>2. Link to architectural information: None >> >>3. HLD available: YES >> >>4. Patterns of basic operations: >> a. RPCs: >> - One OST_CONNECT RPC for each OST. >> b. fs/obd other methods: >> - obd_connect. >> c. cache: None. >> d. Lustre & Linux locks: No suspect locks. >> e. lists, arrays, queues @runtime: >> - obd array obd_devs, the maximum device >>count is 8k, so the osc count must be less than 8k, I think >>it''s enough. >> >> > >Nope - we want this to be far more scalable than 8K OSC''s. Last week we >heard that Evan Felix ran with 4000 OSC''s (getting a whopping 130GB/sec read >from Lustre! > >The array needs to go away. I think Nathan is already working on this for >the load simulator btw. > >Hmm, I don''t see a server side consideration of this problem. Am I missing >something? > > > >There are oscs in the obd_devs of MDS, so it also has the problem, but for OSS, there are not so much obd devs by now.>> - qos_add_tgt() will search and maintain the >>lq_oss_list, this list grows as OSS number grow. >> - Need search connection in the >>imp_conn_list, but this list is quite small and will never grow. >> f. startup data: None. >> >>5. Scalable use pattern of basic operations: >> - One client perform mount. >> - MDS setup. >> >> > >Is connect also used against the management server? > >No, I think it isn''t. Nathan, could you confirm it?> > >>6. Scalability measures: >> - The number of OST_CONNECT RPC is N (OST count), >>since the RPC is sent asynchronously, it runs in O(1) time. >> - Unless we are going to build a cluster with more >>than 8k OSTs, we can''t run out of obd_devs. >> - The time complexity of qos_add_tgt() is O(N), and >>it should only happen when MDS connect OSS, so no need >> to improve it. >> >> > >On the server side isn''t there scanning in the list of existing connections >to see if a UUID of a connection is already in the list? Isn''t that list >O(N) long? If so, the scan is O(N^2)? > >There''ll be another use case analysis for "a large number of Linux clients _MOUNTS_ Lustre", I think server side analysis is better to go there.>Eeb - can you confirm one more time that connection setup, which is likely >to happen at this point in LNET has no linear scans? > > > >>7. Experiment description and findings: >> - No test for it. >> >> > >Nathan - will the load simulator do this? I think it could even be used >over the net? > > > >>8. Recommendations for improvements: >> - No recommendation on implementation improvements. >> >> > >A. Kill the array (P2) >B. Fix the searching on the server (P1) > > > >>9. Non scalable issues encountered, not identified by this process: >> - The qos lists are useless for client''s lov, but we >>have to setup them since MDS and client use >> the same lov driver, this needless list maintenance >>work will burden each client mount, we should >> avoid it. >> >> > >Hmm. But this will change in the future when the client has a full WB >cache, so let''s leave them in. Does the setup scale well? > >Andreas has made a proposal to optimize qos list maintenance, we''ll file a bug to fix it. Thanks - Niu
Nathaniel Rutman
2006-Nov-20 10:21 UTC
[Lustre-devel] Re: [Arch] scalability study: single client _CONNECTS_ to a very large number of OSS servers
Niu YaWei wrote:>>> >>> 5. Scalable use pattern of basic operations: >>> - One client perform mount. >>> - MDS setup. >>> >> >> Is connect also used against the management server? >> >> > No, I think it isn''t. Nathan, could you confirm it? >Yes, the MGC''s are full clients of the MGS. They connect and can get evicted if not heard from. But there is only a single MGC per node in the usual cases -- i.e. multiple clients or OSTs or whatever all share the same MGC.>> >> >>> 6. Scalability measures: >>> - The number of OST_CONNECT RPC is N (OST count), since the >>> RPC is sent asynchronously, it runs in O(1) time. >>> - Unless we are going to build a cluster with more than 8k >>> OSTs, we can''t run out of obd_devs.That''s not true. If you mount two clients on the same node, you need twice the obd''s.>>> - The time complexity of qos_add_tgt() is O(N), and it should >>> only happen when MDS connect OSS, so no need >>> to improve it. >>> >> >> On the server side isn''t there scanning in the list of existing >> connections >> to see if a UUID of a connection is already in the list? Isn''t that >> list >> O(N) long? If so, the scan is O(N^2)? >> >> > There''ll be another use case analysis for "a large number of Linux > clients _MOUNTS_ Lustre", > I think server side analysis is better to go there. > >> Eeb - can you confirm one more time that connection setup, which is >> likely >> to happen at this point in LNET has no linear scans?class_add_uuid is O(N^2) also - see bugzilla 10345>> >> >> >>> 7. Experiment description and findings: >>> - No test for it. >>> >> >> Nathan - will the load simulator do this? I think it could even be used >> over the net?I don''t see any reason why not, except it assumes the local nid for the OST. I could just add that as a param.