thr3ads.net - Lustre discuss - [Lustre-discuss] 1.8.4 and write-through cache [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Stu Midgley

2010-Sep-13 09:31 UTC

[Lustre-discuss] 1.8.4 and write-through cache

Afternoon

I upgraded our oss''s from 1.8.3 to 1.8.4 on Saturday (due to
https://bugzilla.lustre.org/show_bug.cgi?id=22755) and suffered a
great deal of pain.

We have 30 oss''s of multiple vintages.  The basic difference between
them is

  * md on first 20 nodes
  * 3ware 9650SE ML12 on last 10 nodes

After the upgrade to 1.8.4 we were seeing terrible throughput on the
nodes with 3ware cards (and only the nodes with 3ware cards).  This
was typified by see the block device being 100% utilised (iostat),
doing about 100r/s and 400kb/s and all the ost_io threads in D state
(no writes).  They would be in this state for 10mins and then suddenly
awake and start pushing data again.  1-2 mins later, they would lock
up again.

The oss''s were dumping stacks all over the place, crawling along and
generally making our lustrefs unuseable.

After trying different kernels, raid card drivers, changing write back
policy on the raid cards etc. the solution was to

    lctl set_param obdfilter.*.writethrough_cache_enable=0
    lctl set_param obdfilter.*.read_cache_enable=0

on all the nodes with the 3ware cards.

Has anyone else seen this?  I am completely baffled as to why it only
affects our nodes with 3ware cards.

These nodes were working very well under 1.8.3...


-- 
Dr Stuart Midgley
sdm900 at gmail.com

Philippe Weill

2010-Sep-13 10:04 UTC

head link

[Lustre-discuss] 1.8.4 and write-through cache

Le 13/09/2010 11:31, Stu Midgley a ?crit :> Afternoon
>
> I upgraded our oss''s from 1.8.3 to 1.8.4 on Saturday (due to
> https://bugzilla.lustre.org/show_bug.cgi?id=22755) and suffered a
> great deal of pain.
>
> We have 30 oss''s of multiple vintages.  The basic difference
between them is
>
>    * md on first 20 nodes
>    * 3ware 9650SE ML12 on last 10 nodes
>
> After the upgrade to 1.8.4 we were seeing terrible throughput on the
> nodes with 3ware cards (and only the nodes with 3ware cards).  This
> was typified by see the block device being 100% utilised (iostat),
> doing about 100r/s and 400kb/s and all the ost_io threads in D state
> (no writes).  They would be in this state for 10mins and then suddenly
> awake and start pushing data again.  1-2 mins later, they would lock
> up again.
>
> The oss''s were dumping stacks all over the place, crawling along
and
> generally making our lustrefs unuseable.
>
> After trying different kernels, raid card drivers, changing write back
> policy on the raid cards etc. the solution was to
>
>      lctl set_param obdfilter.*.writethrough_cache_enable=0
>      lctl set_param obdfilter.*.read_cache_enable=0
>
> on all the nodes with the 3ware cards.
>
> Has anyone else seen this?  I am completely baffled as to why it only
> affects our nodes with 3ware cards.
>
> These nodes were working very well under 1.8.3...
>
>
we have the same problem here but we''re not on 3ware

qla2462 and xiratex  F5404E 4Gb FC-SAS/SATA-II RAID on 1.8.4

on 1.8.3 this also occure at start but after it''s OK


-- 
  Weill Philippe -  Administrateur Systeme et Reseaux
  CNRS/UPMC/IPSL   LATMOS (UMR 8190)

Kevin Van Maren

2010-Sep-16 21:57 UTC

head link

[Lustre-discuss] 1.8.4 and write-through cache

Stu Midgley wrote:> Afternoon
>
> I upgraded our oss''s from 1.8.3 to 1.8.4 on Saturday (due to
> https://bugzilla.lustre.org/show_bug.cgi?id=22755) and suffered a
> great deal of pain.
>
> We have 30 oss''s of multiple vintages.  The basic difference
between them is
>
>   * md on first 20 nodes
>   * 3ware 9650SE ML12 on last 10 nodes
>
> After the upgrade to 1.8.4 we were seeing terrible throughput on the
> nodes with 3ware cards (and only the nodes with 3ware cards).  This
> was typified by see the block device being 100% utilised (iostat),
> doing about 100r/s and 400kb/s and all the ost_io threads in D state
> (no writes).  They would be in this state for 10mins and then suddenly
> awake and start pushing data again.  1-2 mins later, they would lock
> up again.
>
> The oss''s were dumping stacks all over the place, crawling along
and
> generally making our lustrefs unuseable.
>   
Would you post a few of the stack traces?  Presumably these were driven 
by watchdog timeouts,
but it would help to know where they were getting stuck.
> After trying different kernels, raid card drivers, changing write back
> policy on the raid cards etc. the solution was to
>
>     lctl set_param obdfilter.*.writethrough_cache_enable=0
>     lctl set_param obdfilter.*.read_cache_enable=0
>
> on all the nodes with the 3ware cards.
>
> Has anyone else seen this?  I am completely baffled as to why it only
> affects our nodes with 3ware cards.
>
> These nodes were working very well under 1.8.3...
>
>
>

Apparently Analagous Threads

Search for more possibly parallel threads

Lustre discuss - Sep 2010 - 1.8.4 and write-through cache

[Lustre-discuss] 1.8.4 and write-through cache

[Lustre-discuss] 1.8.4 and write-through cache

[Lustre-discuss] 1.8.4 and write-through cache

Apparently Analagous Threads