thr3ads.net - Btrfs devel - raid1 inefficient unbalanced filesystem reads [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Martin

2013-Jun-28 13:59 UTC

raid1 inefficient unbalanced filesystem reads

On kernel 3.8.13:

Using two equal performance SATAII HDDs, formatted for btrfs raid1 for
both data and metadata and:

The second disk appears to suffer about x8 the read activity of the
first disk. This causes the second disk to quickly get maxed out whilst
the first disk remains almost idle.

Total writes to the two disks is equal.

This is noticeable for example when running "emerge --sync" or running
compiles on Gentoo.


Is this a known feature/problem or worth looking/checking further?

Regards,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Jun-28 15:34 UTC

head link

Re: raid1 inefficient unbalanced filesystem reads

On Fri, Jun 28, 2013 at 02:59:45PM +0100, Martin wrote:> On kernel 3.8.13:
> 
> Using two equal performance SATAII HDDs, formatted for btrfs raid1 for
> both data and metadata and:
> 
> The second disk appears to suffer about x8 the read activity of the
> first disk. This causes the second disk to quickly get maxed out whilst
> the first disk remains almost idle.
> 
> Total writes to the two disks is equal.
> 
> This is noticeable for example when running "emerge --sync" or
running
> compiles on Gentoo.
> 
> 
> Is this a known feature/problem or worth looking/checking further?
So we balance based on pids, so if you have one process that''s doing a
lot of
work it will tend to be stuck on one disk, which is why you are seeing that kind
of imbalance.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2013-Jun-28 15:39 UTC

head link

Re: raid1 inefficient unbalanced filesystem reads

On Fri, Jun 28, 2013 at 11:34:18AM -0400, Josef Bacik
wrote:> On Fri, Jun 28, 2013 at 02:59:45PM +0100, Martin wrote:
> > On kernel 3.8.13:
> > 
> > Using two equal performance SATAII HDDs, formatted for btrfs raid1 for
> > both data and metadata and:
> > 
> > The second disk appears to suffer about x8 the read activity of the
> > first disk. This causes the second disk to quickly get maxed out
whilst
> > the first disk remains almost idle.
> > 
> > Total writes to the two disks is equal.
> > 
> > This is noticeable for example when running "emerge --sync"
or running
> > compiles on Gentoo.
> > 
> > 
> > Is this a known feature/problem or worth looking/checking further?
> 
> So we balance based on pids, so if you have one process that''s
doing a lot of
> work it will tend to be stuck on one disk, which is why you are seeing that
kind
> of imbalance.  Thanks,
   The other scenario is if the sequence of processes executed to do
each compilation step happens to be an even number, then the
heavy-duty file-reading parts will always hit the same parity of PID
number. If each tool has, say, a small wrapper around it, then the
wrappers will all run as (say) odd PIDs, and the tools themselves will
run as even pids...

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- Startle, startle, little twink.  How I wonder what you think. ---

Duncan

2013-Jun-28 15:56 UTC

head link

Re: raid1 inefficient unbalanced filesystem reads

Hugo Mills posted on Fri, 28 Jun 2013 16:39:10 +0100 as excerpted:
> On Fri, Jun 28, 2013 at 11:34:18AM -0400, Josef Bacik wrote:
>> On Fri, Jun 28, 2013 at 02:59:45PM +0100, Martin wrote:
>> > On kernel 3.8.13:
>> > 
>> > Using two equal performance SATAII HDDs, formatted for btrfs raid1
>> > for both data and metadata and:
>> > 
>> > The second disk appears to suffer about x8 the read activity of
the
>> > first disk. This causes the second disk to quickly get maxed out
>> > whilst the first disk remains almost idle.
>> > 
>> > Total writes to the two disks is equal.
>> > 
>> > This is noticeable for example when running "emerge
--sync" or
>> > running compiles on Gentoo.
>> 
>> So we balance based on pids, so if you have one process that''s
doing a
>> lot of work it will tend to be stuck on one disk, which is why you are
>> seeing that kind of imbalance.  Thanks,
> 
>    The other scenario is if the sequence of processes executed to do
> each compilation step happens to be an even number, then the heavy-duty
> file-reading parts will always hit the same parity of PID number. If
> each tool has, say, a small wrapper around it, then the wrappers will
> all run as (say) odd PIDs, and the tools themselves will run as even
> pids...
Ouch and double-ouch!  I''m a gentooer too, but I guess I
haven''t seen the
issue probably because I switched to ssd at the same time I switched to 
btrfs (in dual ssd raid1 mode for data/metadata both), and the 
performance difference between my old reiserfs on spinning rust and my 
new btrfs on ssd is enough that it''s many times faster in any case,
such
that I simply haven''t noticed this issue.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Jun-28 16:25 UTC

head link

Re: raid1 inefficient unbalanced filesystem reads

On 28/06/13 16:39, Hugo Mills wrote:> On Fri, Jun 28, 2013 at 11:34:18AM -0400, Josef Bacik wrote:
>> On Fri, Jun 28, 2013 at 02:59:45PM +0100, Martin wrote:
>>> On kernel 3.8.13:
>>> 
>>> Using two equal performance SATAII HDDs, formatted for btrfs
>>> raid1 for both data and metadata and:
>>> 
>>> The second disk appears to suffer about x8 the read activity of
>>> the first disk. This causes the second disk to quickly get
>>> maxed out whilst the first disk remains almost idle.
>>> 
>>> Total writes to the two disks is equal.
>>> 
>>> This is noticeable for example when running "emerge
--sync" or
>>> running compiles on Gentoo.
>>> 
>>> 
>>> Is this a known feature/problem or worth looking/checking
>>> further?
>> 
>> So we balance based on pids, so if you have one process that''s
>> doing a lot of work it will tend to be stuck on one disk, which
>> is why you are seeing that kind of imbalance.  Thanks,
> 
> The other scenario is if the sequence of processes executed to do 
> each compilation step happens to be an even number, then the 
> heavy-duty file-reading parts will always hit the same parity of
> PID number. If each tool has, say, a small wrapper around it, then
> the wrappers will all run as (say) odd PIDs, and the tools
> themselves will run as even pids...
Ouch! Good find...

To just test with a:

for a in {1..4} ; do ( dd if=/dev/zero of=$a bs=10M count=100 & ) ; done

ps shows:

martin    9776  9.6  0.1  18740 10904 pts/2    D    17:15   0:00 dd
martin    9778  8.5  0.1  18740 10904 pts/2    D    17:15   0:00 dd
martin    9780  8.5  0.1  18740 10904 pts/2    D    17:15   0:00 dd
martin    9782  9.5  0.1  18740 10904 pts/2    D    17:15   0:00 dd


More to the story from atop looks to be:

One disk maxed out with x3 dd on one cpu core, the second disk
utilised by one dd on the second CPU core...


Looks like using a simple round-robin is pathological for an even
number of disks, or indeed if you have a mix of disks with different
capabilities. File access will pile up on the slowest of the disks or
on whatever HDD coincides with the process (pid) creation multiple...


So... an immediate work-around is to go all SSD or work in odd
multiples of HDDs?!

Rather than that: Any easy tweaks available please?


Thanks,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

George Mitchell

2013-Jun-28 16:55 UTC

head link

Re: raid1 inefficient unbalanced filesystem reads

On 06/28/2013 09:25 AM, Martin wrote:> On 28/06/13 16:39, Hugo Mills wrote:
>> On Fri, Jun 28, 2013 at 11:34:18AM -0400, Josef Bacik wrote:
>>> On Fri, Jun 28, 2013 at 02:59:45PM +0100, Martin wrote:
>>>> On kernel 3.8.13:
>>>>
>>>> Using two equal performance SATAII HDDs, formatted for btrfs
>>>> raid1 for both data and metadata and:
>>>>
>>>> The second disk appears to suffer about x8 the read activity of
>>>> the first disk. This causes the second disk to quickly get
>>>> maxed out whilst the first disk remains almost idle.
>>>>
>>>> Total writes to the two disks is equal.
>>>>
>>>> This is noticeable for example when running "emerge
--sync" or
>>>> running compiles on Gentoo.
>>>>
>>>>
>>>> Is this a known feature/problem or worth looking/checking
>>>> further?
>>> So we balance based on pids, so if you have one process
that''s
>>> doing a lot of work it will tend to be stuck on one disk, which
>>> is why you are seeing that kind of imbalance.  Thanks,
>> The other scenario is if the sequence of processes executed to do
>> each compilation step happens to be an even number, then the
>> heavy-duty file-reading parts will always hit the same parity of
>> PID number. If each tool has, say, a small wrapper around it, then
>> the wrappers will all run as (say) odd PIDs, and the tools
>> themselves will run as even pids...
> Ouch! Good find...
>
> To just test with a:
>
> for a in {1..4} ; do ( dd if=/dev/zero of=$a bs=10M count=100 & ) ;
done
>
> ps shows:
>
> martin    9776  9.6  0.1  18740 10904 pts/2    D    17:15   0:00 dd
> martin    9778  8.5  0.1  18740 10904 pts/2    D    17:15   0:00 dd
> martin    9780  8.5  0.1  18740 10904 pts/2    D    17:15   0:00 dd
> martin    9782  9.5  0.1  18740 10904 pts/2    D    17:15   0:00 dd
>
>
> More to the story from atop looks to be:
>
> One disk maxed out with x3 dd on one cpu core, the second disk
> utilised by one dd on the second CPU core...
>
>
> Looks like using a simple round-robin is pathological for an even
> number of disks, or indeed if you have a mix of disks with different
> capabilities. File access will pile up on the slowest of the disks or
> on whatever HDD coincides with the process (pid) creation multiple...
>
>
> So... an immediate work-around is to go all SSD or work in odd
> multiples of HDDs?!
>
> Rather than that: Any easy tweaks available please?
>
>
> Thanks,
> Martin
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>Interesting discussion.  I just put up Gkrellm here to look at this 
issue.  What I am seeing is perhaps disturbing.  I have my root file 
system as RAID 1 on two drives, /dev/sda and /dev/sdb.  I am seeing 
continual read and write activity on /dev/sdb, but nothing at all on 
/dev/sda.  I am sure it will eventually do a big write on /dev/sda to 
sync, but it appears to be essentially using one drive in normal 
routine.  All my other filesystems, /usr, /var, /opt, are RAID 1 across 
five drives.  In this case all drives are active in use ... except the 
fifth drive.  I actually observed a long flow of continual reads and 
writes very balanced across the first four drives in this set and then, 
like a big burp, a huge write on the fifth drive.  But absolutely no 
reads from the fifth drive so far. Very interesting behavior?  These are 
all SATA ncq configured drives.  The first pair are notebook drives, the 
five drive set are all seagate 2.5" enterprise level drives. - George
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Jun-28 17:04 UTC

head link

Re: raid1 inefficient unbalanced filesystem reads

On Fri, Jun 28, 2013 at 09:55:31AM -0700, George Mitchell
wrote:> On 06/28/2013 09:25 AM, Martin wrote:
> >On 28/06/13 16:39, Hugo Mills wrote:
> >>On Fri, Jun 28, 2013 at 11:34:18AM -0400, Josef Bacik wrote:
> >>>On Fri, Jun 28, 2013 at 02:59:45PM +0100, Martin wrote:
> >>>>On kernel 3.8.13:
> >>>>
> >>>>Using two equal performance SATAII HDDs, formatted for
btrfs
> >>>>raid1 for both data and metadata and:
> >>>>
> >>>>The second disk appears to suffer about x8 the read
activity of
> >>>>the first disk. This causes the second disk to quickly get
> >>>>maxed out whilst the first disk remains almost idle.
> >>>>
> >>>>Total writes to the two disks is equal.
> >>>>
> >>>>This is noticeable for example when running "emerge
--sync" or
> >>>>running compiles on Gentoo.
> >>>>
> >>>>
> >>>>Is this a known feature/problem or worth looking/checking
> >>>>further?
> >>>So we balance based on pids, so if you have one process
that''s
> >>>doing a lot of work it will tend to be stuck on one disk, which
> >>>is why you are seeing that kind of imbalance.  Thanks,
> >>The other scenario is if the sequence of processes executed to do
> >>each compilation step happens to be an even number, then the
> >>heavy-duty file-reading parts will always hit the same parity of
> >>PID number. If each tool has, say, a small wrapper around it, then
> >>the wrappers will all run as (say) odd PIDs, and the tools
> >>themselves will run as even pids...
> >Ouch! Good find...
> >
> >To just test with a:
> >
> >for a in {1..4} ; do ( dd if=/dev/zero of=$a bs=10M count=100 & ) ;
done
> >
> >ps shows:
> >
> >martin    9776  9.6  0.1  18740 10904 pts/2    D    17:15   0:00 dd
> >martin    9778  8.5  0.1  18740 10904 pts/2    D    17:15   0:00 dd
> >martin    9780  8.5  0.1  18740 10904 pts/2    D    17:15   0:00 dd
> >martin    9782  9.5  0.1  18740 10904 pts/2    D    17:15   0:00 dd
> >
> >
> >More to the story from atop looks to be:
> >
> >One disk maxed out with x3 dd on one cpu core, the second disk
> >utilised by one dd on the second CPU core...
> >
> >
> >Looks like using a simple round-robin is pathological for an even
> >number of disks, or indeed if you have a mix of disks with different
> >capabilities. File access will pile up on the slowest of the disks or
> >on whatever HDD coincides with the process (pid) creation multiple...
> >
> >
> >So... an immediate work-around is to go all SSD or work in odd
> >multiples of HDDs?!
> >
> >Rather than that: Any easy tweaks available please?
> >
> >
> >Thanks,
> >Martin
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> Interesting discussion.  I just put up Gkrellm here to look at this issue.
> What I am seeing is perhaps disturbing.  I have my root file system as RAID
> 1 on two drives, /dev/sda and /dev/sdb.  I am seeing continual read and
> write activity on /dev/sdb, but nothing at all on /dev/sda.  I am sure it
> will eventually do a big write on /dev/sda to sync, but it appears to be
> essentially using one drive in normal routine.  All my other filesystems,
> /usr, /var, /opt, are RAID 1 across five drives.  In this case all drives
> are active in use ... except the fifth drive.  I actually observed a long
> flow of continual reads and writes very balanced across the first four
> drives in this set and then, like a big burp, a huge write on the fifth
> drive.  But absolutely no reads from the fifth drive so far. Very
> interesting behavior?  These are all SATA ncq configured drives.  The first
> pair are notebook drives, the five drive set are all seagate 2.5"
enterprise
> level drives. - George
Well that is interesting, writes should be relatively balanced across all
drives.  Granted we try and coalesce all writes to one drive, flush those out,
and go on to the next drive, but you shouldn''t be seeing the kind of
activity
you are currently seeing.  I will take a look at it next week and see whats
going on.

As for reads we could definitely be much smarter, I would like to do something
like this (I''m spelling it out in case somebody wants to do it before I
get to
it)

1) Keep a per-device counter of how many read requests have been done.
2) Make the PID based decision, and then check and see if the device
we''ve
chosen has many more read requests than the other device.  If so choose the
other device.
 -> EXCEPTION: if we are doing a big sequential read we want to stay on one
disk
    since the head will be already in place on the disk we''ve been
pegging, so
    ignore the logic for this.  This means saving the last sector we read from
    and comparing it to the next sector we are going to read from, MD does this.
    -> EXCEPTION to the EXCEPTION: if the devices are SSD''s then
don''t bother
       doing this work, always maintain evenness amongst the devices.

If somebody were going to do this, they''d just have to find the places
where we
call find_live_mirror in volumes.c and adjust their logic to just hand
find_live_mirror the entire map and then go through the devices and make their
decision.  You''d still need to keep the device replace logic.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Jun-28 17:45 UTC

head link

Re: raid1 inefficient unbalanced filesystem reads

On 28/06/13 18:04, Josef Bacik wrote:> On Fri, Jun 28, 2013 at 09:55:31AM -0700, George Mitchell wrote:
>> On 06/28/2013 09:25 AM, Martin wrote:
>>> On 28/06/13 16:39, Hugo Mills wrote:
>>>> On Fri, Jun 28, 2013 at 11:34:18AM -0400, Josef Bacik wrote:
>>>>> On Fri, Jun 28, 2013 at 02:59:45PM +0100, Martin wrote:
>>>>>> On kernel 3.8.13:
>> flow of continual reads and writes very balanced across the first four
>> drives in this set and then, like a big burp, a huge write on the fifth
>> drive.  But absolutely no reads from the fifth drive so far. Very
> Well that is interesting, writes should be relatively balanced across all
> drives.  Granted we try and coalesce all writes to one drive, flush those
out,
> and go on to the next drive, but you shouldn''t be seeing the kind
of activity
> you are currently seeing.  I will take a look at it next week and see whats
> going on.
> 
> As for reads we could definitely be much smarter, I would like to do
something
> like this (I''m spelling it out in case somebody wants to do it
before I get to
> it)
> 
> 1) Keep a per-device counter of how many read requests have been done.
> 2) Make the PID based decision, and then check and see if the device
we''ve
> chosen has many more read requests than the other device.  If so choose the
> other device.
>  -> EXCEPTION: if we are doing a big sequential read we want to stay on
one disk
>     since the head will be already in place on the disk we''ve been
pegging, so
>     ignore the logic for this.  This means saving the last sector we read
from
>     and comparing it to the next sector we are going to read from, MD does
this.
>     -> EXCEPTION to the EXCEPTION: if the devices are SSD''s
then don''t bother
>        doing this work, always maintain evenness amongst the devices.
> 
> If somebody were going to do this, they''d just have to find the
places where we
> call find_live_mirror in volumes.c and adjust their logic to just hand
> find_live_mirror the entire map and then go through the devices and make
their
> decision.  You''d still need to keep the device replace logic. 
Thanks,

Mmmm... I''m not sure trying to balance historical read/write counts is
the way to go... What happens for the use case of an SSD paired up with
a HDD? (For example an SSD and a similarly sized Raptor or enterprise
SCSI?...) Or even just JBODs of a mishmash of different speeds?

Rather than trying to balance io counts, can a realtime utilisation
check be made and go for the least busy?

That can be biased secondly to balance IO counts if some
''non-performance'' flag/option is set/wanted by the user.
Otherwise, go
firstly for what is recognised to be the fastest or least busy?...


Good find and good note!

And thanks greatly for so quickly picking this up.

Thanks,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Russell Coker

2013-Jun-29 09:41 UTC

head link

Re: raid1 inefficient unbalanced filesystem reads

On Sat, 29 Jun 2013, Martin <m_btrfs@ml1.co.uk>
wrote:> Mmmm... I''m not sure trying to balance historical read/write
counts is
> the way to go... What happens for the use case of an SSD paired up with
> a HDD? (For example an SSD and a similarly sized Raptor or enterprise
> SCSI?...) Or even just JBODs of a mishmash of different speeds?
> 
> Rather than trying to balance io counts, can a realtime utilisation
> check be made and go for the least busy?
It would also be nice to be able to tune this.  For example I''ve got a
RAID-1
array that''s mounted noatime, hardly ever written, and accessed via NFS
on
100baseT.  It would be nice if one disk could be spun down for most of the 
time and save 7W of system power.  Something like the --write-mostly option of 
mdadm would be good here.

Also it should be possible for a RAID-1 array to allow faster reads for a 
single process reading a single file if the file in question is fragmented.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Jun-29 14:04 UTC

head link

Re: raid1 inefficient unbalanced filesystem reads

On 29/06/13 10:41, Russell Coker wrote:> On Sat, 29 Jun 2013, Martin wrote:
>> Mmmm... I''m not sure trying to balance historical read/write
counts is
>> the way to go... What happens for the use case of an SSD paired up with
>> a HDD? (For example an SSD and a similarly sized Raptor or enterprise
>> SCSI?...) Or even just JBODs of a mishmash of different speeds?
>>
>> Rather than trying to balance io counts, can a realtime utilisation
>> check be made and go for the least busy?
> 
> It would also be nice to be able to tune this.  For example I''ve
got a RAID-1
> array that''s mounted noatime, hardly ever written, and accessed
via NFS on
> 100baseT.  It would be nice if one disk could be spun down for most of the 
> time and save 7W of system power.  Something like the --write-mostly option
of
> mdadm would be good here.
For that case, a "--read-mostly" would be more apt ;-)

Hence, add a check to preferentially use last disk used if all are idle?

> Also it should be possible for a RAID-1 array to allow faster reads for a 
> single process reading a single file if the file in question is fragmented.
That sounds good but complicated to gather and sort the fragments into
groups per disk... Or is something like that already done by the block
device elevator for HDDs?

Also, is head seek optimisation turned off for SSD accesses?


(This is sounding like a lot more than just swapping:

"current->pid % map->num_stripes"

to a

"psuedorandomhash( current->pid ) % map->num_stripes"

... ;-) )


Are there any readily accessible present state for such as disk activity
or queue length or access latency available for the btrfs process to read?

I suspect a good first guess to cover many conditions would be to
''simply'' choose whichever device is powered up and has the
lowest
current latency, or if idle has the lowest historical latency...


Regards,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Jun 2013 - raid1 inefficient unbalanced filesystem reads

raid1 inefficient unbalanced filesystem reads

Re: raid1 inefficient unbalanced filesystem reads

Re: raid1 inefficient unbalanced filesystem reads

Re: raid1 inefficient unbalanced filesystem reads

Re: raid1 inefficient unbalanced filesystem reads

Re: raid1 inefficient unbalanced filesystem reads

Re: raid1 inefficient unbalanced filesystem reads

Re: raid1 inefficient unbalanced filesystem reads

Re: raid1 inefficient unbalanced filesystem reads

Re: raid1 inefficient unbalanced filesystem reads