Afternoon I upgraded our oss''s from 1.8.3 to 1.8.4 on Saturday (due to https://bugzilla.lustre.org/show_bug.cgi?id=22755) and suffered a great deal of pain. We have 30 oss''s of multiple vintages. The basic difference between them is * md on first 20 nodes * 3ware 9650SE ML12 on last 10 nodes After the upgrade to 1.8.4 we were seeing terrible throughput on the nodes with 3ware cards (and only the nodes with 3ware cards). This was typified by see the block device being 100% utilised (iostat), doing about 100r/s and 400kb/s and all the ost_io threads in D state (no writes). They would be in this state for 10mins and then suddenly awake and start pushing data again. 1-2 mins later, they would lock up again. The oss''s were dumping stacks all over the place, crawling along and generally making our lustrefs unuseable. After trying different kernels, raid card drivers, changing write back policy on the raid cards etc. the solution was to lctl set_param obdfilter.*.writethrough_cache_enable=0 lctl set_param obdfilter.*.read_cache_enable=0 on all the nodes with the 3ware cards. Has anyone else seen this? I am completely baffled as to why it only affects our nodes with 3ware cards. These nodes were working very well under 1.8.3... -- Dr Stuart Midgley sdm900 at gmail.com
Le 13/09/2010 11:31, Stu Midgley a ?crit :> Afternoon > > I upgraded our oss''s from 1.8.3 to 1.8.4 on Saturday (due to > https://bugzilla.lustre.org/show_bug.cgi?id=22755) and suffered a > great deal of pain. > > We have 30 oss''s of multiple vintages. The basic difference between them is > > * md on first 20 nodes > * 3ware 9650SE ML12 on last 10 nodes > > After the upgrade to 1.8.4 we were seeing terrible throughput on the > nodes with 3ware cards (and only the nodes with 3ware cards). This > was typified by see the block device being 100% utilised (iostat), > doing about 100r/s and 400kb/s and all the ost_io threads in D state > (no writes). They would be in this state for 10mins and then suddenly > awake and start pushing data again. 1-2 mins later, they would lock > up again. > > The oss''s were dumping stacks all over the place, crawling along and > generally making our lustrefs unuseable. > > After trying different kernels, raid card drivers, changing write back > policy on the raid cards etc. the solution was to > > lctl set_param obdfilter.*.writethrough_cache_enable=0 > lctl set_param obdfilter.*.read_cache_enable=0 > > on all the nodes with the 3ware cards. > > Has anyone else seen this? I am completely baffled as to why it only > affects our nodes with 3ware cards. > > These nodes were working very well under 1.8.3... > >we have the same problem here but we''re not on 3ware qla2462 and xiratex F5404E 4Gb FC-SAS/SATA-II RAID on 1.8.4 on 1.8.3 this also occure at start but after it''s OK -- Weill Philippe - Administrateur Systeme et Reseaux CNRS/UPMC/IPSL LATMOS (UMR 8190)
Stu Midgley wrote:> Afternoon > > I upgraded our oss''s from 1.8.3 to 1.8.4 on Saturday (due to > https://bugzilla.lustre.org/show_bug.cgi?id=22755) and suffered a > great deal of pain. > > We have 30 oss''s of multiple vintages. The basic difference between them is > > * md on first 20 nodes > * 3ware 9650SE ML12 on last 10 nodes > > After the upgrade to 1.8.4 we were seeing terrible throughput on the > nodes with 3ware cards (and only the nodes with 3ware cards). This > was typified by see the block device being 100% utilised (iostat), > doing about 100r/s and 400kb/s and all the ost_io threads in D state > (no writes). They would be in this state for 10mins and then suddenly > awake and start pushing data again. 1-2 mins later, they would lock > up again. > > The oss''s were dumping stacks all over the place, crawling along and > generally making our lustrefs unuseable. >Would you post a few of the stack traces? Presumably these were driven by watchdog timeouts, but it would help to know where they were getting stuck.> After trying different kernels, raid card drivers, changing write back > policy on the raid cards etc. the solution was to > > lctl set_param obdfilter.*.writethrough_cache_enable=0 > lctl set_param obdfilter.*.read_cache_enable=0 > > on all the nodes with the 3ware cards. > > Has anyone else seen this? I am completely baffled as to why it only > affects our nodes with 3ware cards. > > These nodes were working very well under 1.8.3... > > >