thr3ads.net - zfs discuss - [zfs-discuss] reboot when copying large amounts of data [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Blake

2009-Mar-11 14:06 UTC

[zfs-discuss] reboot when copying large amounts of data

I have a H8DM8-2 motherboard with a pair of AOC-SAT2-MV8 SATA
controller cards in a 16-disk Supermicro chassis.

I''m running OpenSolaris 2008.11, and the machine performs very well
unless I start to copy a large amount of data to the ZFS (software
raid) array that''s on the Supermicro SATA controllers.  If I do this,
the machine inevitably reboots.

What can I do to troubleshoot?  The BIOS of the motherboard and the
SATA card firmware are fully updated.  I''m running the latest stable
OpenSolaris, and see nothing amiss in the system logs when this
happens.  I''ve enabled savecore and debug-level syslog, but am getting
no indicators from Solaris as to what''s wrong.

Interestingly, I can push the same amount of data to the mirror boot
disks, which are on the board''s built-in nVidia SATA controller
without issue.

The vdev I''m pushing to is a 5-disk raidz2 with 2 hot spares.

Help!  :)

Marc Bevand

2009-Mar-11 15:23 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

The copy operation will make all the disks start seeking at the same time and 
will make your CPU activity jump to a significant percentage to compute the 
ZFS checksum and RAIDZ parity. I think you could be overloading your PSU 
because of the sudden increase in power consumption...

However if you are *not* using SATA staggered spin-up, then the above theory 
is unlikely because spinning up consumes much more power than when seeking. 
So, in a sense, a successful boot proves your PSU is powerful enough.

Trying reproducing the problem by copying data on a smaller number of disks. 
You tried 2 and 16. Try 8.

-marc

Blake

2009-Mar-11 15:33 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

I''m working on testing this some more by doing a savecore -L right
after I start the copy.

BTW, I''m copying to a raidz2 of only 5 disks, not 16 (the chassis
supports 16, but isn''t fully populated).

So far as I know, there is no spinup happening - these are not RAID
controllers, just dumb SATA JBOD controllers, so I don''t think they
control drive spin in any particular way.  Correct me if I''m wrong, of
course.

On Wed, Mar 11, 2009 at 11:23 AM, Marc Bevand <m.bevand at gmail.com>
wrote:> The copy operation will make all the disks start seeking at the same time
and
> will make your CPU activity jump to a significant percentage to compute the
> ZFS checksum and RAIDZ parity. I think you could be overloading your PSU
> because of the sudden increase in power consumption...
>
> However if you are *not* using SATA staggered spin-up, then the above
theory
> is unlikely because spinning up consumes much more power than when seeking.
> So, in a sense, a successful boot proves your PSU is powerful enough.
>
> Trying reproducing the problem by copying data on a smaller number of
disks.
> You tried 2 and 16. Try 8.
>
> -marc
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Blake

2009-Mar-11 18:40 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

I''m attaching a screenshot of the console just before reboot.  The
dump doesn''t seem to be working, or savecore isn''t working.

On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at gmail.com>
wrote:> I''m working on testing this some more by doing a savecore -L right
> after I start the copy.
>
> BTW, I''m copying to a raidz2 of only 5 disks, not 16 (the chassis
> supports 16, but isn''t fully populated).
>
> So far as I know, there is no spinup happening - these are not RAID
> controllers, just dumb SATA JBOD controllers, so I don''t think
they
> control drive spin in any particular way. ?Correct me if I''m
wrong, of
> course.
>
>
>
> On Wed, Mar 11, 2009 at 11:23 AM, Marc Bevand <m.bevand at gmail.com>
wrote:
>> The copy operation will make all the disks start seeking at the same
time and
>> will make your CPU activity jump to a significant percentage to compute
the
>> ZFS checksum and RAIDZ parity. I think you could be overloading your
PSU
>> because of the sudden increase in power consumption...
>>
>> However if you are *not* using SATA staggered spin-up, then the above
theory
>> is unlikely because spinning up consumes much more power than when
seeking.
>> So, in a sense, a successful boot proves your PSU is powerful enough.
>>
>> Trying reproducing the problem by copying data on a smaller number of
disks.
>> You tried 2 and 16. Try 8.
>>
>> -marc
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>-------------- next part --------------
A non-text attachment was scrubbed...
Name: IMG_0146.JPG
Type: image/jpeg
Size: 647193 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090311/1c6bbe11/attachment.jpe>

Remco Lengers

2009-Mar-11 19:08 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

Something is not right in the IO space. The messages talk about

vendor ID = 11AB

0x11AB	Marvell Semiconductor

TMC Research

Vendor Id: 0x1030
Short Name: TMC

Does "fmdump -eV" give any clue when the box comes back up?

..Remco


Blake wrote:> I''m attaching a screenshot of the console just before reboot.  The
> dump doesn''t seem to be working, or savecore isn''t
working.
> 
> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at gmail.com>
wrote:
>> I''m working on testing this some more by doing a savecore -L
right
>> after I start the copy.
>>
>> BTW, I''m copying to a raidz2 of only 5 disks, not 16 (the
chassis
>> supports 16, but isn''t fully populated).
>>
>> So far as I know, there is no spinup happening - these are not RAID
>> controllers, just dumb SATA JBOD controllers, so I don''t think
they
>> control drive spin in any particular way.  Correct me if I''m
wrong, of
>> course.
>>
>>
>>
>> On Wed, Mar 11, 2009 at 11:23 AM, Marc Bevand <m.bevand at
gmail.com> wrote:
>>> The copy operation will make all the disks start seeking at the
same time and
>>> will make your CPU activity jump to a significant percentage to
compute the
>>> ZFS checksum and RAIDZ parity. I think you could be overloading
your PSU
>>> because of the sudden increase in power consumption...
>>>
>>> However if you are *not* using SATA staggered spin-up, then the
above theory
>>> is unlikely because spinning up consumes much more power than when
seeking.
>>> So, in a sense, a successful boot proves your PSU is powerful
enough.
>>>
>>> Trying reproducing the problem by copying data on a smaller number
of disks.
>>> You tried 2 and 16. Try 8.
>>>
>>> -marc
>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>>
>>
------------------------------------------------------------------------
>>
>>
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Blake

2009-Mar-11 19:09 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

fmdump is not helping much:

root at host:~# fmdump -eV
TIME                           CLASS
fmdump: /var/fm/fmd/errlog is empty


comparing that screenshot to the output of cfgadm is interesting -
looks like the controller(s):

root at host:~# cfgadm -v
Ap_Id                          Receptacle   Occupant     Condition  Information
When         Type         Busy     Phys_Id
sata4/0::dsk/c4t0d0            connected    configured   ok
Mod: ST3250310NS FRev: SN06 SN: 9SF06CZZ
unavailable  disk         n        /devices/pci at 0,0/pci15d9,1611 at 5:0
sata4/1::dsk/c4t1d0            connected    configured   ok
Mod: ST3250310NS FRev: SN06 SN: 9SF06BC8
unavailable  disk         n        /devices/pci at 0,0/pci15d9,1611 at 5:1
sata5/0                        empty        unconfigured ok
unavailable  sata-port    n        /devices/pci at 0,0/pci15d9,1611 at 5,1:0
sata5/1::dsk/c7t1d0            connected    configured   ok
Mod: WDC WD10EACS-00D6B0 FRev: 01.01A01 SN: WD-WCAU40244615
unavailable  disk         n        /devices/pci at 0,0/pci15d9,1611 at 5,1:1
sata6/0                        empty        unconfigured ok
unavailable  sata-port    n        /devices/pci at 0,0/pci15d9,1611 at 5,2:0
sata6/1                        empty        unconfigured ok
unavailable  sata-port    n        /devices/pci at 0,0/pci15d9,1611 at 5,2:1
sata7/0                        empty        unconfigured ok
unavailable  sata-port    n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4:0
sata7/1                        empty        unconfigured ok
unavailable  sata-port    n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4:1
sata7/2::dsk/c5t2d0            connected    configured   ok
Mod: WDC WD7500AYYS-01RCA0 FRev: 30.04G30 SN: WD-WCAPT0376631
unavailable  disk         n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4:2
sata7/3::dsk/c5t3d0            connected    configured   ok
Mod: WDC WD7500AYYS-01RCA0 FRev: 30.04G30 SN: WD-WCAPT0350798
unavailable  disk         n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4:3
sata7/4::dsk/c5t4d0            connected    configured   ok
Mod: WDC WD7500AYYS-01RCA0 FRev: 30.04G30 SN: WD-WCAPT0403574
unavailable  disk         n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4:4
sata7/5::dsk/c5t5d0            connected    configured   ok
Mod: WDC WD7500AYYS-01RCA0 FRev: 30.04G30 SN: WD-WCAPT0312592
unavailable  disk         n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4:5
sata7/6::dsk/c5t6d0            connected    configured   ok
Mod: WDC WD7500AYYS-01RCA0 FRev: 30.04G30 SN: WD-WCAPT0399779
unavailable  disk         n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4:6
sata7/7::dsk/c5t7d0            connected    configured   ok
Mod: WDC WD7500AYYS-01RCA0 FRev: 30.04G30 SN: WD-WCAPT0441660
unavailable  disk         n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4:7
sata8/0::dsk/c6t0d0            connected    configured   ok
Mod: WDC WD7500AYYS-01RCA0 FRev: 30.04G30 SN: WD-WCAPT1000344
unavailable  disk         n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6:0
sata8/1                        empty        unconfigured ok
unavailable  sata-port    n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6:1
sata8/2                        empty        unconfigured ok
unavailable  sata-port    n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6:2
sata8/3                        empty        unconfigured ok
unavailable  sata-port    n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6:3
sata8/4                        empty        unconfigured ok
unavailable  sata-port    n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6:4
sata8/5                        empty        unconfigured ok
unavailable  sata-port    n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6:5
sata8/6                        empty        unconfigured ok
unavailable  sata-port    n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6:6
sata8/7                        empty        unconfigured ok
unavailable  sata-port    n
/devices/pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6:7


On Wed, Mar 11, 2009 at 2:40 PM, Blake <blake.irvin at gmail.com>
wrote:> I''m attaching a screenshot of the console just before reboot. ?The
> dump doesn''t seem to be working, or savecore isn''t
working.
>
> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at gmail.com>
wrote:
>> I''m working on testing this some more by doing a savecore -L
right
>> after I start the copy.
>>
>> BTW, I''m copying to a raidz2 of only 5 disks, not 16 (the
chassis
>> supports 16, but isn''t fully populated).
>>
>> So far as I know, there is no spinup happening - these are not RAID
>> controllers, just dumb SATA JBOD controllers, so I don''t think
they
>> control drive spin in any particular way. ?Correct me if I''m
wrong, of
>> course.
>>
>>
>>
>> On Wed, Mar 11, 2009 at 11:23 AM, Marc Bevand <m.bevand at
gmail.com> wrote:
>>> The copy operation will make all the disks start seeking at the
same time and
>>> will make your CPU activity jump to a significant percentage to
compute the
>>> ZFS checksum and RAIDZ parity. I think you could be overloading
your PSU
>>> because of the sudden increase in power consumption...
>>>
>>> However if you are *not* using SATA staggered spin-up, then the
above theory
>>> is unlikely because spinning up consumes much more power than when
seeking.
>>> So, in a sense, a successful boot proves your PSU is powerful
enough.
>>>
>>> Trying reproducing the problem by copying data on a smaller number
of disks.
>>> You tried 2 and 16. Try 8.
>>>
>>> -marc
>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>>
>

Blake

2009-Mar-11 19:14 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

I think that TMC Research is the company that designed the
Supermicro-branded controller card that has the Marvell SATA
controller chip on it.  Googling around I see connections between
Supermicro and TMC.

This is the card:

http://www.supermicro.com/products/accessories/addon/AOC-SAT2-MV8.cfm

On Wed, Mar 11, 2009 at 3:08 PM, Remco Lengers <remco at lengers.com>
wrote:> Something is not right in the IO space. The messages talk about
>
> vendor ID = 11AB
>
> 0x11AB ?Marvell Semiconductor
>
> TMC Research
>
> Vendor Id: 0x1030
> Short Name: TMC
>
> Does "fmdump -eV" give any clue when the box comes back up?
>
> ..Remco
>
>
> Blake wrote:
>>
>> I''m attaching a screenshot of the console just before reboot.
?The
>> dump doesn''t seem to be working, or savecore isn''t
working.
>>
>> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at
gmail.com> wrote:
>>>
>>> I''m working on testing this some more by doing a savecore
-L right
>>> after I start the copy.
>>>
>>> BTW, I''m copying to a raidz2 of only 5 disks, not 16 (the
chassis
>>> supports 16, but isn''t fully populated).
>>>
>>> So far as I know, there is no spinup happening - these are not RAID
>>> controllers, just dumb SATA JBOD controllers, so I don''t
think they
>>> control drive spin in any particular way. ?Correct me if
I''m wrong, of
>>> course.
>>>
>>>
>>>
>>> On Wed, Mar 11, 2009 at 11:23 AM, Marc Bevand <m.bevand at
gmail.com> wrote:
>>>>
>>>> The copy operation will make all the disks start seeking at the
same
>>>> time and
>>>> will make your CPU activity jump to a significant percentage to
compute
>>>> the
>>>> ZFS checksum and RAIDZ parity. I think you could be overloading
your PSU
>>>> because of the sudden increase in power consumption...
>>>>
>>>> However if you are *not* using SATA staggered spin-up, then the
above
>>>> theory
>>>> is unlikely because spinning up consumes much more power than
when
>>>> seeking.
>>>> So, in a sense, a successful boot proves your PSU is powerful
enough.
>>>>
>>>> Trying reproducing the problem by copying data on a smaller
number of
>>>> disks.
>>>> You tried 2 and 16. Try 8.
>>>>
>>>> -marc
>>>>
>>>> _______________________________________________
>>>> zfs-discuss mailing list
>>>> zfs-discuss at opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>
>>>
>>>
------------------------------------------------------------------------
>>>
>>>
>>>
------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Blake

2009-Mar-11 19:38 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

Could the problem be related to this bug:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6793353

I''m testing setting the maximum payload size as a workaround, as noted
in the bug notes.



On Wed, Mar 11, 2009 at 3:14 PM, Blake <blake.irvin at gmail.com>
wrote:> I think that TMC Research is the company that designed the
> Supermicro-branded controller card that has the Marvell SATA
> controller chip on it. ?Googling around I see connections between
> Supermicro and TMC.
>
> This is the card:
>
> http://www.supermicro.com/products/accessories/addon/AOC-SAT2-MV8.cfm
>
> On Wed, Mar 11, 2009 at 3:08 PM, Remco Lengers <remco at lengers.com>
wrote:
>> Something is not right in the IO space. The messages talk about
>>
>> vendor ID = 11AB
>>
>> 0x11AB ?Marvell Semiconductor
>>
>> TMC Research
>>
>> Vendor Id: 0x1030
>> Short Name: TMC
>>
>> Does "fmdump -eV" give any clue when the box comes back up?
>>
>> ..Remco
>>
>>
>> Blake wrote:
>>>
>>> I''m attaching a screenshot of the console just before
reboot. ?The
>>> dump doesn''t seem to be working, or savecore
isn''t working.
>>>
>>> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at
gmail.com> wrote:
>>>>
>>>> I''m working on testing this some more by doing a
savecore -L right
>>>> after I start the copy.
>>>>
>>>> BTW, I''m copying to a raidz2 of only 5 disks, not 16
(the chassis
>>>> supports 16, but isn''t fully populated).
>>>>
>>>> So far as I know, there is no spinup happening - these are not
RAID
>>>> controllers, just dumb SATA JBOD controllers, so I
don''t think they
>>>> control drive spin in any particular way. ?Correct me if
I''m wrong, of
>>>> course.
>>>>
>>>>
>>>>
>>>> On Wed, Mar 11, 2009 at 11:23 AM, Marc Bevand <m.bevand at
gmail.com> wrote:
>>>>>
>>>>> The copy operation will make all the disks start seeking at
the same
>>>>> time and
>>>>> will make your CPU activity jump to a significant
percentage to compute
>>>>> the
>>>>> ZFS checksum and RAIDZ parity. I think you could be
overloading your PSU
>>>>> because of the sudden increase in power consumption...
>>>>>
>>>>> However if you are *not* using SATA staggered spin-up, then
the above
>>>>> theory
>>>>> is unlikely because spinning up consumes much more power
than when
>>>>> seeking.
>>>>> So, in a sense, a successful boot proves your PSU is
powerful enough.
>>>>>
>>>>> Trying reproducing the problem by copying data on a smaller
number of
>>>>> disks.
>>>>> You tried 2 and 16. Try 8.
>>>>>
>>>>> -marc
>>>>>
>>>>> _______________________________________________
>>>>> zfs-discuss mailing list
>>>>> zfs-discuss at opensolaris.org
>>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>>
>>>>
>>>>
------------------------------------------------------------------------
>>>>
>>>>
>>>>
------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> zfs-discuss mailing list
>>>> zfs-discuss at opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>

Remco Lengers

2009-Mar-11 19:41 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

looks worth a go otherwise:

if the boot disk is also off that controller it may be too hosed to 
write anything to the boot disk hence FMA doesn''t see any issue when it
comes up. Possible further actions:

- Upgrade FW of controller to highest or known working level
- Upgrade driver or OS level.
- Try another controller (may be its broken and barfs under stress ?)
- Analyze the crash dump (if any is saved)
- It may be its a know Solaris or driver bug and somebody has heard of 
it before.

hth,

..Remco

Blake wrote:> Could the problem be related to this bug:
> 
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6793353
> 
> I''m testing setting the maximum payload size as a workaround, as
noted
> in the bug notes.
> 
> 
> 
> On Wed, Mar 11, 2009 at 3:14 PM, Blake <blake.irvin at gmail.com>
wrote:
>> I think that TMC Research is the company that designed the
>> Supermicro-branded controller card that has the Marvell SATA
>> controller chip on it.  Googling around I see connections between
>> Supermicro and TMC.
>>
>> This is the card:
>>
>> http://www.supermicro.com/products/accessories/addon/AOC-SAT2-MV8.cfm
>>
>> On Wed, Mar 11, 2009 at 3:08 PM, Remco Lengers <remco at
lengers.com> wrote:
>>> Something is not right in the IO space. The messages talk about
>>>
>>> vendor ID = 11AB
>>>
>>> 0x11AB  Marvell Semiconductor
>>>
>>> TMC Research
>>>
>>> Vendor Id: 0x1030
>>> Short Name: TMC
>>>
>>> Does "fmdump -eV" give any clue when the box comes back
up?
>>>
>>> ..Remco
>>>
>>>
>>> Blake wrote:
>>>> I''m attaching a screenshot of the console just before
reboot.  The
>>>> dump doesn''t seem to be working, or savecore
isn''t working.
>>>>
>>>> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at
gmail.com> wrote:
>>>>> I''m working on testing this some more by doing a
savecore -L right
>>>>> after I start the copy.
>>>>>
>>>>> BTW, I''m copying to a raidz2 of only 5 disks, not
16 (the chassis
>>>>> supports 16, but isn''t fully populated).
>>>>>
>>>>> So far as I know, there is no spinup happening - these are
not RAID
>>>>> controllers, just dumb SATA JBOD controllers, so I
don''t think they
>>>>> control drive spin in any particular way.  Correct me if
I''m wrong, of
>>>>> course.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 11, 2009 at 11:23 AM, Marc Bevand <m.bevand
at gmail.com> wrote:
>>>>>> The copy operation will make all the disks start
seeking at the same
>>>>>> time and
>>>>>> will make your CPU activity jump to a significant
percentage to compute
>>>>>> the
>>>>>> ZFS checksum and RAIDZ parity. I think you could be
overloading your PSU
>>>>>> because of the sudden increase in power consumption...
>>>>>>
>>>>>> However if you are *not* using SATA staggered spin-up,
then the above
>>>>>> theory
>>>>>> is unlikely because spinning up consumes much more
power than when
>>>>>> seeking.
>>>>>> So, in a sense, a successful boot proves your PSU is
powerful enough.
>>>>>>
>>>>>> Trying reproducing the problem by copying data on a
smaller number of
>>>>>> disks.
>>>>>> You tried 2 and 16. Try 8.
>>>>>>
>>>>>> -marc
>>>>>>
>>>>>> _______________________________________________
>>>>>> zfs-discuss mailing list
>>>>>> zfs-discuss at opensolaris.org
>>>>>>
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>>>
>>>>>
------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> zfs-discuss mailing list
>>>>> zfs-discuss at opensolaris.org
>>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Blake

2009-Mar-11 19:50 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

Any chance this could be the motherboard?  I suspect the controller.

The boot disks are on the built-in nVidia controller.

On Wed, Mar 11, 2009 at 3:41 PM, Remco Lengers <remco at lengers.com>
wrote:> - Upgrade FW of controller to highest or known working levelI think I have the latest controller firmware.
> - Upgrade driver or OS level.I''m going to try to go from 101b to 108 or whatever the current dev
release is.
> - Try another controller (may be its broken and barfs under stress ?)In the works.
> - Analyze the crash dump (if any is saved)Crash dump is not saving properly.
> - It may be its a know Solaris or driver bug and somebody has heard of it
> before.Any takers on this?  :)
>
> hth,Thanks!
>
> ..Remco
>
> Blake wrote:
>>
>> Could the problem be related to this bug:
>>
>> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6793353
>>
>> I''m testing setting the maximum payload size as a workaround,
as noted
>> in the bug notes.
>>
>>
>>
>> On Wed, Mar 11, 2009 at 3:14 PM, Blake <blake.irvin at gmail.com>
wrote:
>>>
>>> I think that TMC Research is the company that designed the
>>> Supermicro-branded controller card that has the Marvell SATA
>>> controller chip on it. ?Googling around I see connections between
>>> Supermicro and TMC.
>>>
>>> This is the card:
>>>
>>>
http://www.supermicro.com/products/accessories/addon/AOC-SAT2-MV8.cfm
>>>
>>> On Wed, Mar 11, 2009 at 3:08 PM, Remco Lengers <remco at
lengers.com> wrote:
>>>>
>>>> Something is not right in the IO space. The messages talk about
>>>>
>>>> vendor ID = 11AB
>>>>
>>>> 0x11AB ?Marvell Semiconductor
>>>>
>>>> TMC Research
>>>>
>>>> Vendor Id: 0x1030
>>>> Short Name: TMC
>>>>
>>>> Does "fmdump -eV" give any clue when the box comes
back up?
>>>>
>>>> ..Remco
>>>>
>>>>
>>>> Blake wrote:
>>>>>
>>>>> I''m attaching a screenshot of the console just
before reboot. ?The
>>>>> dump doesn''t seem to be working, or savecore
isn''t working.
>>>>>
>>>>> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at
gmail.com> wrote:
>>>>>>
>>>>>> I''m working on testing this some more by doing
a savecore -L right
>>>>>> after I start the copy.
>>>>>>
>>>>>> BTW, I''m copying to a raidz2 of only 5 disks,
not 16 (the chassis
>>>>>> supports 16, but isn''t fully populated).
>>>>>>
>>>>>> So far as I know, there is no spinup happening - these
are not RAID
>>>>>> controllers, just dumb SATA JBOD controllers, so I
don''t think they
>>>>>> control drive spin in any particular way. ?Correct me
if I''m wrong, of
>>>>>> course.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 11, 2009 at 11:23 AM, Marc Bevand
<m.bevand at gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> The copy operation will make all the disks start
seeking at the same
>>>>>>> time and
>>>>>>> will make your CPU activity jump to a significant
percentage to
>>>>>>> compute
>>>>>>> the
>>>>>>> ZFS checksum and RAIDZ parity. I think you could be
overloading your
>>>>>>> PSU
>>>>>>> because of the sudden increase in power
consumption...
>>>>>>>
>>>>>>> However if you are *not* using SATA staggered
spin-up, then the above
>>>>>>> theory
>>>>>>> is unlikely because spinning up consumes much more
power than when
>>>>>>> seeking.
>>>>>>> So, in a sense, a successful boot proves your PSU
is powerful enough.
>>>>>>>
>>>>>>> Trying reproducing the problem by copying data on a
smaller number of
>>>>>>> disks.
>>>>>>> You tried 2 and 16. Try 8.
>>>>>>>
>>>>>>> -marc
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> zfs-discuss mailing list
>>>>>>> zfs-discuss at opensolaris.org
>>>>>>>
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>>>>
>>>>>>
>>>>>>
------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>>
------------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> zfs-discuss mailing list
>>>>>> zfs-discuss at opensolaris.org
>>>>>>
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Richard Elling

2009-Mar-11 20:58 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

Blake wrote:> I''m attaching a screenshot of the console just before reboot.  The
> dump doesn''t seem to be working, or savecore isn''t
working.
>
> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at gmail.com>
wrote:
>   
>> I''m working on testing this some more by doing a savecore -L
right
>> after I start the copy.
>>
>>     
savecore -L is not what you want.

By default, for OpenSolaris, savecore on boot is disabled.  But the core
will have been dumped into the dump slice, which is not used for swap.
So you should be able to run savecore at a later time to collect the
core from the last dump.
 -- richard

Blake

2009-Mar-11 21:44 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

I guess I didn''t make it clear that I had already tried using savecore
to retrieve the core from the dump device.

I added a larger zvol for dump, to make sure that I wasn''t running out
of space on the dump device:

root at host:~# dumpadm
      Dump content: kernel pages
       Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated)
Savecore directory: /var/crash/host
  Savecore enabled: yes

I was using the -L option only to try to get some idea of why the
system load was climbing to 1 during a simple file copy.



On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling
<richard.elling at gmail.com> wrote:> Blake wrote:
>>
>> I''m attaching a screenshot of the console just before reboot.
?The
>> dump doesn''t seem to be working, or savecore isn''t
working.
>>
>> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at
gmail.com> wrote:
>>
>>>
>>> I''m working on testing this some more by doing a savecore
-L right
>>> after I start the copy.
>>>
>>>
>
> savecore -L is not what you want.
>
> By default, for OpenSolaris, savecore on boot is disabled. ?But the core
> will have been dumped into the dump slice, which is not used for swap.
> So you should be able to run savecore at a later time to collect the
> core from the last dump.
> -- richard
>
>

Maidak Alexander J

2009-Mar-12 04:31 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

If you''re having issues with a disk contoller or disk IO driver its
highly likely that a savecore to disk after the panic will fail.  I''m
not sure how to work around this, maybe a dedicated dump device not on a
controller that uses a different driver then the one that you''re having
issues with?

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at
opensolaris.org] On Behalf Of Blake
Sent: Wednesday, March 11, 2009 4:45 PM
To: Richard Elling
Cc: Marc Bevand; zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] reboot when copying large amounts of data

I guess I didn''t make it clear that I had already tried using savecore
to retrieve the core from the dump device.

I added a larger zvol for dump, to make sure that I wasn''t running out
of space on the dump device:

root at host:~# dumpadm
      Dump content: kernel pages
       Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated) Savecore directory:
/var/crash/host
  Savecore enabled: yes

I was using the -L option only to try to get some idea of why the system load
was climbing to 1 during a simple file copy.

On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling <richard.elling at
gmail.com> wrote:> Blake wrote:
>>
>> I''m attaching a screenshot of the console just before reboot.
?The
>> dump doesn''t seem to be working, or savecore isn''t
working.
>>
>> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at
gmail.com> wrote:
>>
>>>
>>> I''m working on testing this some more by doing a savecore
-L right
>>> after I start the copy.
>>>
>>>
>
> savecore -L is not what you want.
>
> By default, for OpenSolaris, savecore on boot is disabled. ?But the 
> core will have been dumped into the dump slice, which is not used for swap.
> So you should be able to run savecore at a later time to collect the 
> core from the last dump.
> -- richard
>
>_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Blake

2009-Mar-12 05:29 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

My dump device is already on a different controller - the motherboards
built-in nVidia SATA controller.

The raidz2 vdev is the one I''m having trouble with (copying the same
files to the mirrored rpool on the nVidia controller work nicely).  I
do notice that, when using cp to copy the files to the raidz2 pool,
load on the machine climbs steadily until the crash, and one proc core
pegs at 100%.

Frustrating, yes.

On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J
<MaidakAlexanderJ at johndeere.com> wrote:> If you''re having issues with a disk contoller or disk IO driver
its highly likely that a savecore to disk after the panic will fail.
?I''m not sure how to work around this, maybe a dedicated dump device
not on a controller that uses a different driver then the one that
you''re having issues with?
>
> -----Original Message-----
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at
opensolaris.org] On Behalf Of Blake
> Sent: Wednesday, March 11, 2009 4:45 PM
> To: Richard Elling
> Cc: Marc Bevand; zfs-discuss at opensolaris.org
> Subject: Re: [zfs-discuss] reboot when copying large amounts of data
>
> I guess I didn''t make it clear that I had already tried using
savecore to retrieve the core from the dump device.
>
> I added a larger zvol for dump, to make sure that I wasn''t running
out of space on the dump device:
>
> root at host:~# dumpadm
> ? ? ?Dump content: kernel pages
> ? ? ? Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated) Savecore
directory: /var/crash/host
> ?Savecore enabled: yes
>
> I was using the -L option only to try to get some idea of why the system
load was climbing to 1 during a simple file copy.
>
>
>
> On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling <richard.elling at
gmail.com> wrote:
>> Blake wrote:
>>>
>>> I''m attaching a screenshot of the console just before
reboot. ?The
>>> dump doesn''t seem to be working, or savecore
isn''t working.
>>>
>>> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at
gmail.com> wrote:
>>>
>>>>
>>>> I''m working on testing this some more by doing a
savecore -L right
>>>> after I start the copy.
>>>>
>>>>
>>
>> savecore -L is not what you want.
>>
>> By default, for OpenSolaris, savecore on boot is disabled. ?But the
>> core will have been dumped into the dump slice, which is not used for
swap.
>> So you should be able to run savecore at a later time to collect the
>> core from the last dump.
>> -- richard
>>
>>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Nathan Kroenert

2009-Mar-12 05:55 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

Hm -

Crashes, or hangs? Moreover - how do you know a CPU is pegged?

Seems like we could do a little more discovery on what the actual 
problem here is, as I can read it about 4 different ways.

By this last piece of information, I''m guessing the system does not 
crash, but goes really really slow??

Crash == panic == we see stack dump on console and try to take a dump
hang == nothing works == no response -> might be worth looking at mdb -K
         or booting with a -k on the boot line.

So - are we crashing, hanging, or something different?

It might simply be that you are eating up all your memory, and your 
physical backing storage is taking a while to catch up....?

Nathan.

Blake wrote:> My dump device is already on a different controller - the motherboards
> built-in nVidia SATA controller.
> 
> The raidz2 vdev is the one I''m having trouble with (copying the
same
> files to the mirrored rpool on the nVidia controller work nicely).  I
> do notice that, when using cp to copy the files to the raidz2 pool,
> load on the machine climbs steadily until the crash, and one proc core
> pegs at 100%.
> 
> Frustrating, yes.
> 
> On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J
> <MaidakAlexanderJ at johndeere.com> wrote:
>> If you''re having issues with a disk contoller or disk IO
driver its highly likely that a savecore to disk after the panic will fail. 
I''m not sure how to work around this, maybe a dedicated dump device not
on a controller that uses a different driver then the one that you''re
having issues with?
>>
>> -----Original Message-----
>> From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Blake
>> Sent: Wednesday, March 11, 2009 4:45 PM
>> To: Richard Elling
>> Cc: Marc Bevand; zfs-discuss at opensolaris.org
>> Subject: Re: [zfs-discuss] reboot when copying large amounts of data
>>
>> I guess I didn''t make it clear that I had already tried using
savecore to retrieve the core from the dump device.
>>
>> I added a larger zvol for dump, to make sure that I wasn''t
running out of space on the dump device:
>>
>> root at host:~# dumpadm
>>      Dump content: kernel pages
>>       Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated) Savecore
directory: /var/crash/host
>>  Savecore enabled: yes
>>
>> I was using the -L option only to try to get some idea of why the
system load was climbing to 1 during a simple file copy.
>>
>>
>>
>> On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling <richard.elling at
gmail.com> wrote:
>>> Blake wrote:
>>>> I''m attaching a screenshot of the console just before
reboot.  The
>>>> dump doesn''t seem to be working, or savecore
isn''t working.
>>>>
>>>> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at
gmail.com> wrote:
>>>>
>>>>> I''m working on testing this some more by doing a
savecore -L right
>>>>> after I start the copy.
>>>>>
>>>>>
>>> savecore -L is not what you want.
>>>
>>> By default, for OpenSolaris, savecore on boot is disabled.  But the
>>> core will have been dumped into the dump slice, which is not used
for swap.
>>> So you should be able to run savecore at a later time to collect
the
>>> core from the last dump.
>>> -- richard
>>>
>>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
//////////////////////////////////////////////////////////////////
// Nathan Kroenert              nathan.kroenert at sun.com         //
// Systems Engineer             Phone:  +61 3 9869-6255         //
// Sun Microsystems             Fax:    +61 3 9869-6288         //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456            //
// Melbourne 3004   Victoria    Australia                       //
//////////////////////////////////////////////////////////////////

Blake

2009-Mar-12 06:11 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

I start the cp, and then, with prstat -a, watch the cpu load for the
cp process climb to 25% on a 4-core machine.

Load, measured for example with ''uptime'', climbs steadily
until the reboot.

Note that the machine does not dump properly, panic or hang - rather,
it reboots.

I attached a screenshot earlier in this thread of the little bit of
error message I could see on the console.  The machine is trying to
dump to the dump zvol, but fails to do so.  Only sometimes do I see an
error on the machine''s local console - mos times, it simply reboots.



On Thu, Mar 12, 2009 at 1:55 AM, Nathan Kroenert
<Nathan.Kroenert at sun.com> wrote:> Hm -
>
> Crashes, or hangs? Moreover - how do you know a CPU is pegged?
>
> Seems like we could do a little more discovery on what the actual problem
> here is, as I can read it about 4 different ways.
>
> By this last piece of information, I''m guessing the system does
not crash,
> but goes really really slow??
>
> Crash == panic == we see stack dump on console and try to take a dump
> hang == nothing works == no response -> might be worth looking at mdb -K
> ? ? ? ?or booting with a -k on the boot line.
>
> So - are we crashing, hanging, or something different?
>
> It might simply be that you are eating up all your memory, and your
physical
> backing storage is taking a while to catch up....?
>
> Nathan.
>
> Blake wrote:
>>
>> My dump device is already on a different controller - the motherboards
>> built-in nVidia SATA controller.
>>
>> The raidz2 vdev is the one I''m having trouble with (copying
the same
>> files to the mirrored rpool on the nVidia controller work nicely). ?I
>> do notice that, when using cp to copy the files to the raidz2 pool,
>> load on the machine climbs steadily until the crash, and one proc core
>> pegs at 100%.
>>
>> Frustrating, yes.
>>
>> On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J
>> <MaidakAlexanderJ at johndeere.com> wrote:
>>>
>>> If you''re having issues with a disk contoller or disk IO
driver its
>>> highly likely that a savecore to disk after the panic will fail.
?I''m not
>>> sure how to work around this, maybe a dedicated dump device not on
a
>>> controller that uses a different driver then the one that
you''re having
>>> issues with?
>>>
>>> -----Original Message-----
>>> From: zfs-discuss-bounces at opensolaris.org
>>> [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Blake
>>> Sent: Wednesday, March 11, 2009 4:45 PM
>>> To: Richard Elling
>>> Cc: Marc Bevand; zfs-discuss at opensolaris.org
>>> Subject: Re: [zfs-discuss] reboot when copying large amounts of
data
>>>
>>> I guess I didn''t make it clear that I had already tried
using savecore to
>>> retrieve the core from the dump device.
>>>
>>> I added a larger zvol for dump, to make sure that I wasn''t
running out of
>>> space on the dump device:
>>>
>>> root at host:~# dumpadm
>>> ? ? Dump content: kernel pages
>>> ? ? ?Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated) Savecore
>>> directory: /var/crash/host
>>> ?Savecore enabled: yes
>>>
>>> I was using the -L option only to try to get some idea of why the
system
>>> load was climbing to 1 during a simple file copy.
>>>
>>>
>>>
>>> On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling
>>> <richard.elling at gmail.com> wrote:
>>>>
>>>> Blake wrote:
>>>>>
>>>>> I''m attaching a screenshot of the console just
before reboot. ?The
>>>>> dump doesn''t seem to be working, or savecore
isn''t working.
>>>>>
>>>>> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin at
gmail.com> wrote:
>>>>>
>>>>>> I''m working on testing this some more by doing
a savecore -L right
>>>>>> after I start the copy.
>>>>>>
>>>>>>
>>>> savecore -L is not what you want.
>>>>
>>>> By default, for OpenSolaris, savecore on boot is disabled. ?But
the
>>>> core will have been dumped into the dump slice, which is not
used for
>>>> swap.
>>>> So you should be able to run savecore at a later time to
collect the
>>>> core from the last dump.
>>>> -- richard
>>>>
>>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> --
> //////////////////////////////////////////////////////////////////
> // Nathan Kroenert ? ? ? ? ? ? ?nathan.kroenert at sun.com ? ? ? ? //
> // Systems Engineer ? ? ? ? ? ? Phone: ?+61 3 9869-6255 ? ? ? ? //
> // Sun Microsystems ? ? ? ? ? ? Fax: ? ?+61 3 9869-6288 ? ? ? ? //
> // Level 7, 476 St. Kilda Road ?Mobile: 0419 305 456 ? ? ? ? ? ?//
> // Melbourne 3004 ? Victoria ? ?Australia ? ? ? ? ? ? ? ? ? ? ? //
> //////////////////////////////////////////////////////////////////
>

Nathan Kroenert

2009-Mar-12 06:18 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

definitely time to bust out some mdb -k and see what it''s moaning
about.

I did not see the screenshot earlier... sorry about that.

Nathan.

Blake wrote:> I start the cp, and then, with prstat -a, watch the cpu load for the
> cp process climb to 25% on a 4-core machine.
> 
> Load, measured for example with ''uptime'', climbs steadily
until the reboot.
> 
> Note that the machine does not dump properly, panic or hang - rather,
> it reboots.
> 
> I attached a screenshot earlier in this thread of the little bit of
> error message I could see on the console.  The machine is trying to
> dump to the dump zvol, but fails to do so.  Only sometimes do I see an
> error on the machine''s local console - mos times, it simply
reboots.
> 
> 
> 
> On Thu, Mar 12, 2009 at 1:55 AM, Nathan Kroenert
> <Nathan.Kroenert at sun.com> wrote:
>> Hm -
>>
>> Crashes, or hangs? Moreover - how do you know a CPU is pegged?
>>
>> Seems like we could do a little more discovery on what the actual
problem
>> here is, as I can read it about 4 different ways.
>>
>> By this last piece of information, I''m guessing the system
does not crash,
>> but goes really really slow??
>>
>> Crash == panic == we see stack dump on console and try to take a dump
>> hang == nothing works == no response -> might be worth looking at
mdb -K
>>        or booting with a -k on the boot line.
>>
>> So - are we crashing, hanging, or something different?
>>
>> It might simply be that you are eating up all your memory, and your
physical
>> backing storage is taking a while to catch up....?
>>
>> Nathan.
>>
>> Blake wrote:
>>> My dump device is already on a different controller - the
motherboards
>>> built-in nVidia SATA controller.
>>>
>>> The raidz2 vdev is the one I''m having trouble with
(copying the same
>>> files to the mirrored rpool on the nVidia controller work nicely). 
I
>>> do notice that, when using cp to copy the files to the raidz2 pool,
>>> load on the machine climbs steadily until the crash, and one proc
core
>>> pegs at 100%.
>>>
>>> Frustrating, yes.
>>>
>>> On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J
>>> <MaidakAlexanderJ at johndeere.com> wrote:
>>>> If you''re having issues with a disk contoller or disk
IO driver its
>>>> highly likely that a savecore to disk after the panic will
fail.  I''m not
>>>> sure how to work around this, maybe a dedicated dump device not
on a
>>>> controller that uses a different driver then the one that
you''re having
>>>> issues with?
>>>>
>>>> -----Original Message-----
>>>> From: zfs-discuss-bounces at opensolaris.org
>>>> [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of
Blake
>>>> Sent: Wednesday, March 11, 2009 4:45 PM
>>>> To: Richard Elling
>>>> Cc: Marc Bevand; zfs-discuss at opensolaris.org
>>>> Subject: Re: [zfs-discuss] reboot when copying large amounts of
data
>>>>
>>>> I guess I didn''t make it clear that I had already
tried using savecore to
>>>> retrieve the core from the dump device.
>>>>
>>>> I added a larger zvol for dump, to make sure that I
wasn''t running out of
>>>> space on the dump device:
>>>>
>>>> root at host:~# dumpadm
>>>>     Dump content: kernel pages
>>>>      Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated)
Savecore
>>>> directory: /var/crash/host
>>>>  Savecore enabled: yes
>>>>
>>>> I was using the -L option only to try to get some idea of why
the system
>>>> load was climbing to 1 during a simple file copy.
>>>>
>>>>
>>>>
>>>> On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling
>>>> <richard.elling at gmail.com> wrote:
>>>>> Blake wrote:
>>>>>> I''m attaching a screenshot of the console just
before reboot.  The
>>>>>> dump doesn''t seem to be working, or savecore
isn''t working.
>>>>>>
>>>>>> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin
at gmail.com> wrote:
>>>>>>
>>>>>>> I''m working on testing this some more by
doing a savecore -L right
>>>>>>> after I start the copy.
>>>>>>>
>>>>>>>
>>>>> savecore -L is not what you want.
>>>>>
>>>>> By default, for OpenSolaris, savecore on boot is disabled. 
But the
>>>>> core will have been dumped into the dump slice, which is
not used for
>>>>> swap.
>>>>> So you should be able to run savecore at a later time to
collect the
>>>>> core from the last dump.
>>>>> -- richard
>>>>>
>>>>>
>>>> _______________________________________________
>>>> zfs-discuss mailing list
>>>> zfs-discuss at opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>> --
>> //////////////////////////////////////////////////////////////////
>> // Nathan Kroenert              nathan.kroenert at sun.com         //
>> // Systems Engineer             Phone:  +61 3 9869-6255         //
>> // Sun Microsystems             Fax:    +61 3 9869-6288         //
>> // Level 7, 476 St. Kilda Road  Mobile: 0419 305 456            //
>> // Melbourne 3004   Victoria    Australia                       //
>> //////////////////////////////////////////////////////////////////
>>
-- 
//////////////////////////////////////////////////////////////////
// Nathan Kroenert              nathan.kroenert at sun.com         //
// Systems Engineer             Phone:  +61 3 9869-6255         //
// Sun Microsystems             Fax:    +61 3 9869-6288         //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456            //
// Melbourne 3004   Victoria    Australia                       //
//////////////////////////////////////////////////////////////////

Nathan Kroenert

2009-Mar-12 06:18 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

definitely time to bust out some mdb -K or boot -k and see what it''s 
moaning about.

I did not see the screenshot earlier... sorry about that.

Nathan.

Blake wrote:> I start the cp, and then, with prstat -a, watch the cpu load for the
> cp process climb to 25% on a 4-core machine.
> 
> Load, measured for example with ''uptime'', climbs steadily
until the reboot.
> 
> Note that the machine does not dump properly, panic or hang - rather,
> it reboots.
> 
> I attached a screenshot earlier in this thread of the little bit of
> error message I could see on the console.  The machine is trying to
> dump to the dump zvol, but fails to do so.  Only sometimes do I see an
> error on the machine''s local console - mos times, it simply
reboots.
> 
> 
> 
> On Thu, Mar 12, 2009 at 1:55 AM, Nathan Kroenert
> <Nathan.Kroenert at sun.com> wrote:
>> Hm -
>>
>> Crashes, or hangs? Moreover - how do you know a CPU is pegged?
>>
>> Seems like we could do a little more discovery on what the actual
problem
>> here is, as I can read it about 4 different ways.
>>
>> By this last piece of information, I''m guessing the system
does not crash,
>> but goes really really slow??
>>
>> Crash == panic == we see stack dump on console and try to take a dump
>> hang == nothing works == no response -> might be worth looking at
mdb -K
>>        or booting with a -k on the boot line.
>>
>> So - are we crashing, hanging, or something different?
>>
>> It might simply be that you are eating up all your memory, and your
physical
>> backing storage is taking a while to catch up....?
>>
>> Nathan.
>>
>> Blake wrote:
>>> My dump device is already on a different controller - the
motherboards
>>> built-in nVidia SATA controller.
>>>
>>> The raidz2 vdev is the one I''m having trouble with
(copying the same
>>> files to the mirrored rpool on the nVidia controller work nicely). 
I
>>> do notice that, when using cp to copy the files to the raidz2 pool,
>>> load on the machine climbs steadily until the crash, and one proc
core
>>> pegs at 100%.
>>>
>>> Frustrating, yes.
>>>
>>> On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J
>>> <MaidakAlexanderJ at johndeere.com> wrote:
>>>> If you''re having issues with a disk contoller or disk
IO driver its
>>>> highly likely that a savecore to disk after the panic will
fail.  I''m not
>>>> sure how to work around this, maybe a dedicated dump device not
on a
>>>> controller that uses a different driver then the one that
you''re having
>>>> issues with?
>>>>
>>>> -----Original Message-----
>>>> From: zfs-discuss-bounces at opensolaris.org
>>>> [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of
Blake
>>>> Sent: Wednesday, March 11, 2009 4:45 PM
>>>> To: Richard Elling
>>>> Cc: Marc Bevand; zfs-discuss at opensolaris.org
>>>> Subject: Re: [zfs-discuss] reboot when copying large amounts of
data
>>>>
>>>> I guess I didn''t make it clear that I had already
tried using savecore to
>>>> retrieve the core from the dump device.
>>>>
>>>> I added a larger zvol for dump, to make sure that I
wasn''t running out of
>>>> space on the dump device:
>>>>
>>>> root at host:~# dumpadm
>>>>     Dump content: kernel pages
>>>>      Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated)
Savecore
>>>> directory: /var/crash/host
>>>>  Savecore enabled: yes
>>>>
>>>> I was using the -L option only to try to get some idea of why
the system
>>>> load was climbing to 1 during a simple file copy.
>>>>
>>>>
>>>>
>>>> On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling
>>>> <richard.elling at gmail.com> wrote:
>>>>> Blake wrote:
>>>>>> I''m attaching a screenshot of the console just
before reboot.  The
>>>>>> dump doesn''t seem to be working, or savecore
isn''t working.
>>>>>>
>>>>>> On Wed, Mar 11, 2009 at 11:33 AM, Blake <blake.irvin
at gmail.com> wrote:
>>>>>>
>>>>>>> I''m working on testing this some more by
doing a savecore -L right
>>>>>>> after I start the copy.
>>>>>>>
>>>>>>>
>>>>> savecore -L is not what you want.
>>>>>
>>>>> By default, for OpenSolaris, savecore on boot is disabled. 
But the
>>>>> core will have been dumped into the dump slice, which is
not used for
>>>>> swap.
>>>>> So you should be able to run savecore at a later time to
collect the
>>>>> core from the last dump.
>>>>> -- richard
>>>>>
>>>>>
>>>> _______________________________________________
>>>> zfs-discuss mailing list
>>>> zfs-discuss at opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>> --
>> //////////////////////////////////////////////////////////////////
>> // Nathan Kroenert              nathan.kroenert at sun.com         //
>> // Systems Engineer             Phone:  +61 3 9869-6255         //
>> // Sun Microsystems             Fax:    +61 3 9869-6288         //
>> // Level 7, 476 St. Kilda Road  Mobile: 0419 305 456            //
>> // Melbourne 3004   Victoria    Australia                       //
>> //////////////////////////////////////////////////////////////////
>>
-- 
//////////////////////////////////////////////////////////////////
// Nathan Kroenert              nathan.kroenert at sun.com         //
// Systems Engineer             Phone:  +61 3 9869-6255         //
// Sun Microsystems             Fax:    +61 3 9869-6288         //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456            //
// Melbourne 3004   Victoria    Australia                       //
//////////////////////////////////////////////////////////////////

Blake

2009-Mar-12 06:53 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

So, if I boot with the -k boot flags (to load the kernel debugger?)
what do I need to look for?  I''m no expert at kernel debugging.

I think this is a pci error judging by the console output, or at least
is i/o related...

thanks for your feedback,
Blake

On Thu, Mar 12, 2009 at 2:18 AM, Nathan Kroenert
<Nathan.Kroenert at sun.com> wrote:> definitely time to bust out some mdb -K or boot -k and see what
it''s moaning
> about.
>
> I did not see the screenshot earlier... sorry about that.
>
> Nathan.
>
> Blake wrote:
>>
>> I start the cp, and then, with prstat -a, watch the cpu load for the
>> cp process climb to 25% on a 4-core machine.
>>
>> Load, measured for example with ''uptime'', climbs
steadily until the
>> reboot.
>>
>> Note that the machine does not dump properly, panic or hang - rather,
>> it reboots.
>>
>> I attached a screenshot earlier in this thread of the little bit of
>> error message I could see on the console. ?The machine is trying to
>> dump to the dump zvol, but fails to do so. ?Only sometimes do I see an
>> error on the machine''s local console - mos times, it simply
reboots.
>>
>>
>>
>> On Thu, Mar 12, 2009 at 1:55 AM, Nathan Kroenert
>> <Nathan.Kroenert at sun.com> wrote:
>>>
>>> Hm -
>>>
>>> Crashes, or hangs? Moreover - how do you know a CPU is pegged?
>>>
>>> Seems like we could do a little more discovery on what the actual
problem
>>> here is, as I can read it about 4 different ways.
>>>
>>> By this last piece of information, I''m guessing the system
does not
>>> crash,
>>> but goes really really slow??
>>>
>>> Crash == panic == we see stack dump on console and try to take a
dump
>>> hang == nothing works == no response -> might be worth looking
at mdb -K
>>> ? ? ? or booting with a -k on the boot line.
>>>
>>> So - are we crashing, hanging, or something different?
>>>
>>> It might simply be that you are eating up all your memory, and your
>>> physical
>>> backing storage is taking a while to catch up....?
>>>
>>> Nathan.
>>>
>>> Blake wrote:
>>>>
>>>> My dump device is already on a different controller - the
motherboards
>>>> built-in nVidia SATA controller.
>>>>
>>>> The raidz2 vdev is the one I''m having trouble with
(copying the same
>>>> files to the mirrored rpool on the nVidia controller work
nicely). ?I
>>>> do notice that, when using cp to copy the files to the raidz2
pool,
>>>> load on the machine climbs steadily until the crash, and one
proc core
>>>> pegs at 100%.
>>>>
>>>> Frustrating, yes.
>>>>
>>>> On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J
>>>> <MaidakAlexanderJ at johndeere.com> wrote:
>>>>>
>>>>> If you''re having issues with a disk contoller or
disk IO driver its
>>>>> highly likely that a savecore to disk after the panic will
fail. ?I''m
>>>>> not
>>>>> sure how to work around this, maybe a dedicated dump device
not on a
>>>>> controller that uses a different driver then the one that
you''re having
>>>>> issues with?
>>>>>
>>>>> -----Original Message-----
>>>>> From: zfs-discuss-bounces at opensolaris.org
>>>>> [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf
Of Blake
>>>>> Sent: Wednesday, March 11, 2009 4:45 PM
>>>>> To: Richard Elling
>>>>> Cc: Marc Bevand; zfs-discuss at opensolaris.org
>>>>> Subject: Re: [zfs-discuss] reboot when copying large
amounts of data
>>>>>
>>>>> I guess I didn''t make it clear that I had already
tried using savecore
>>>>> to
>>>>> retrieve the core from the dump device.
>>>>>
>>>>> I added a larger zvol for dump, to make sure that I
wasn''t running out
>>>>> of
>>>>> space on the dump device:
>>>>>
>>>>> root at host:~# dumpadm
>>>>> ? ?Dump content: kernel pages
>>>>> ? ? Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated)
Savecore
>>>>> directory: /var/crash/host
>>>>> ?Savecore enabled: yes
>>>>>
>>>>> I was using the -L option only to try to get some idea of
why the
>>>>> system
>>>>> load was climbing to 1 during a simple file copy.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling
>>>>> <richard.elling at gmail.com> wrote:
>>>>>>
>>>>>> Blake wrote:
>>>>>>>
>>>>>>> I''m attaching a screenshot of the console
just before reboot. ?The
>>>>>>> dump doesn''t seem to be working, or
savecore isn''t working.
>>>>>>>
>>>>>>> On Wed, Mar 11, 2009 at 11:33 AM, Blake
<blake.irvin at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I''m working on testing this some more
by doing a savecore -L right
>>>>>>>> after I start the copy.
>>>>>>>>
>>>>>>>>
>>>>>> savecore -L is not what you want.
>>>>>>
>>>>>> By default, for OpenSolaris, savecore on boot is
disabled. ?But the
>>>>>> core will have been dumped into the dump slice, which
is not used for
>>>>>> swap.
>>>>>> So you should be able to run savecore at a later time
to collect the
>>>>>> core from the last dump.
>>>>>> -- richard
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> zfs-discuss mailing list
>>>>> zfs-discuss at opensolaris.org
>>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>>
>>>> _______________________________________________
>>>> zfs-discuss mailing list
>>>> zfs-discuss at opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>>> --
>>> //////////////////////////////////////////////////////////////////
>>> // Nathan Kroenert ? ? ? ? ? ? ?nathan.kroenert at sun.com ? ? ? ?
//
>>> // Systems Engineer ? ? ? ? ? ? Phone: ?+61 3 9869-6255 ? ? ? ? //
>>> // Sun Microsystems ? ? ? ? ? ? Fax: ? ?+61 3 9869-6288 ? ? ? ? //
>>> // Level 7, 476 St. Kilda Road ?Mobile: 0419 305 456 ? ? ? ? ? ?//
>>> // Melbourne 3004 ? Victoria ? ?Australia ? ? ? ? ? ? ? ? ? ? ? //
>>> //////////////////////////////////////////////////////////////////
>>>
>
> --
> //////////////////////////////////////////////////////////////////
> // Nathan Kroenert ? ? ? ? ? ? ?nathan.kroenert at sun.com ? ? ? ? //
> // Systems Engineer ? ? ? ? ? ? Phone: ?+61 3 9869-6255 ? ? ? ? //
> // Sun Microsystems ? ? ? ? ? ? Fax: ? ?+61 3 9869-6288 ? ? ? ? //
> // Level 7, 476 St. Kilda Road ?Mobile: 0419 305 456 ? ? ? ? ? ?//
> // Melbourne 3004 ? Victoria ? ?Australia ? ? ? ? ? ? ? ? ? ? ? //
> //////////////////////////////////////////////////////////////////
>

Miles Nordin

2009-Mar-12 18:12 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

>>>>> "maj" == Maidak Alexander J <MaidakAlexanderJ
at JohnDeere.com> writes:
maj> If you''re having issues with a disk contoller or disk IO
maj> driver its highly likely that a savecore to disk after the
maj> panic will fail. I''m not sure how to work around this

not in Solaris, but as a concept for solving the problem:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/kdump/kdump.txt;h=3f4bc840da8b7c068076dd057216e846e098db9f;hb=4a6908a3a050aacc9c3a2f36b276b46c0629ad91

They load a second kernel into a reserved spot of RAM, like 64MB or
so, and forget about it. After a crash, they boot the second kernel.
The second kernel runs using the reserved area of RAM as its working
space, not touching any other memory, as if you were running on a very
old machine with tiny RAM. It reprobes all the hardware, and then
performs the dump. I don''t know if it actually works, but the
approach is appropriate if you are trying to debug the storage stack.
You could even have a main kernel which crashes while taking an
ordinary coredump, and then use the backup dumping-kernel to coredump
the main kernel in mid-coredump---a dump of a dumping kernel.

I think some Solaris developers were discussing putting coredump
features into Xen, so the host could take the dump (or, maybe even
something better than a dump---for example, if you built host/target
debugging features into Xen for debugging running kernels, then you
could just force a breakpoint in the guest instead of panic. Since
Xen can hibernate domU''s onto disk (it can, right?), you can treat the
hibernated Xen-specific representation of the domU as the-dump,
groveling through the ``dump'''' with the same host/target tools
you
could use on a running kernel without any special dump support in the
debugger itself). IIRC NetBSD developers discussed the same idea
years ago but neither implementation exists.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090312/0d730cae/attachment.bin>

Blake

2009-Mar-12 19:22 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

I''ve managed to get the data transfer to work by rearranging my disks
so that all of them sit on the integrated SATA controller.

So, I feel pretty certain that this is either an issue with the
Supermicro aoc-sat2-mv8 card, or with PCI-X on the motherboard (though
I would think that the integrated SATA would also be using the PCI
bus?).

The motherboard, for those interested, is an HD8ME-2 (not, I now find
after buying this box from Silicon Mechanics, a board that''s on the
Solaris HCL...)

<http://www.supermicro.com/Aplus/motherboard/Opteron2000/MCP55/h8dme-2.cfm>

So I''m not considering one of LSI''s HBA''s - what do
list members think
about this device:

<http://www.provantage.com/lsi-logic-lsi00117~7LSIG03X.htm>



On Thu, Mar 12, 2009 at 2:18 AM, Nathan Kroenert
<Nathan.Kroenert at sun.com> wrote:> definitely time to bust out some mdb -K or boot -k and see what
it''s moaning
> about.
>
> I did not see the screenshot earlier... sorry about that.
>
> Nathan.
>
> Blake wrote:
>>
>> I start the cp, and then, with prstat -a, watch the cpu load for the
>> cp process climb to 25% on a 4-core machine.
>>
>> Load, measured for example with ''uptime'', climbs
steadily until the
>> reboot.
>>
>> Note that the machine does not dump properly, panic or hang - rather,
>> it reboots.
>>
>> I attached a screenshot earlier in this thread of the little bit of
>> error message I could see on the console. ?The machine is trying to
>> dump to the dump zvol, but fails to do so. ?Only sometimes do I see an
>> error on the machine''s local console - mos times, it simply
reboots.
>>
>>
>>
>> On Thu, Mar 12, 2009 at 1:55 AM, Nathan Kroenert
>> <Nathan.Kroenert at sun.com> wrote:
>>>
>>> Hm -
>>>
>>> Crashes, or hangs? Moreover - how do you know a CPU is pegged?
>>>
>>> Seems like we could do a little more discovery on what the actual
problem
>>> here is, as I can read it about 4 different ways.
>>>
>>> By this last piece of information, I''m guessing the system
does not
>>> crash,
>>> but goes really really slow??
>>>
>>> Crash == panic == we see stack dump on console and try to take a
dump
>>> hang == nothing works == no response -> might be worth looking
at mdb -K
>>> ? ? ? or booting with a -k on the boot line.
>>>
>>> So - are we crashing, hanging, or something different?
>>>
>>> It might simply be that you are eating up all your memory, and your
>>> physical
>>> backing storage is taking a while to catch up....?
>>>
>>> Nathan.
>>>
>>> Blake wrote:
>>>>
>>>> My dump device is already on a different controller - the
motherboards
>>>> built-in nVidia SATA controller.
>>>>
>>>> The raidz2 vdev is the one I''m having trouble with
(copying the same
>>>> files to the mirrored rpool on the nVidia controller work
nicely). ?I
>>>> do notice that, when using cp to copy the files to the raidz2
pool,
>>>> load on the machine climbs steadily until the crash, and one
proc core
>>>> pegs at 100%.
>>>>
>>>> Frustrating, yes.
>>>>
>>>> On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J
>>>> <MaidakAlexanderJ at johndeere.com> wrote:
>>>>>
>>>>> If you''re having issues with a disk contoller or
disk IO driver its
>>>>> highly likely that a savecore to disk after the panic will
fail. ?I''m
>>>>> not
>>>>> sure how to work around this, maybe a dedicated dump device
not on a
>>>>> controller that uses a different driver then the one that
you''re having
>>>>> issues with?
>>>>>
>>>>> -----Original Message-----
>>>>> From: zfs-discuss-bounces at opensolaris.org
>>>>> [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf
Of Blake
>>>>> Sent: Wednesday, March 11, 2009 4:45 PM
>>>>> To: Richard Elling
>>>>> Cc: Marc Bevand; zfs-discuss at opensolaris.org
>>>>> Subject: Re: [zfs-discuss] reboot when copying large
amounts of data
>>>>>
>>>>> I guess I didn''t make it clear that I had already
tried using savecore
>>>>> to
>>>>> retrieve the core from the dump device.
>>>>>
>>>>> I added a larger zvol for dump, to make sure that I
wasn''t running out
>>>>> of
>>>>> space on the dump device:
>>>>>
>>>>> root at host:~# dumpadm
>>>>> ? ?Dump content: kernel pages
>>>>> ? ? Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated)
Savecore
>>>>> directory: /var/crash/host
>>>>> ?Savecore enabled: yes
>>>>>
>>>>> I was using the -L option only to try to get some idea of
why the
>>>>> system
>>>>> load was climbing to 1 during a simple file copy.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling
>>>>> <richard.elling at gmail.com> wrote:
>>>>>>
>>>>>> Blake wrote:
>>>>>>>
>>>>>>> I''m attaching a screenshot of the console
just before reboot. ?The
>>>>>>> dump doesn''t seem to be working, or
savecore isn''t working.
>>>>>>>
>>>>>>> On Wed, Mar 11, 2009 at 11:33 AM, Blake
<blake.irvin at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I''m working on testing this some more
by doing a savecore -L right
>>>>>>>> after I start the copy.
>>>>>>>>
>>>>>>>>
>>>>>> savecore -L is not what you want.
>>>>>>
>>>>>> By default, for OpenSolaris, savecore on boot is
disabled. ?But the
>>>>>> core will have been dumped into the dump slice, which
is not used for
>>>>>> swap.
>>>>>> So you should be able to run savecore at a later time
to collect the
>>>>>> core from the last dump.
>>>>>> -- richard
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> zfs-discuss mailing list
>>>>> zfs-discuss at opensolaris.org
>>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>>
>>>> _______________________________________________
>>>> zfs-discuss mailing list
>>>> zfs-discuss at opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>>> --
>>> //////////////////////////////////////////////////////////////////
>>> // Nathan Kroenert ? ? ? ? ? ? ?nathan.kroenert at sun.com ? ? ? ?
//
>>> // Systems Engineer ? ? ? ? ? ? Phone: ?+61 3 9869-6255 ? ? ? ? //
>>> // Sun Microsystems ? ? ? ? ? ? Fax: ? ?+61 3 9869-6288 ? ? ? ? //
>>> // Level 7, 476 St. Kilda Road ?Mobile: 0419 305 456 ? ? ? ? ? ?//
>>> // Melbourne 3004 ? Victoria ? ?Australia ? ? ? ? ? ? ? ? ? ? ? //
>>> //////////////////////////////////////////////////////////////////
>>>
>
> --
> //////////////////////////////////////////////////////////////////
> // Nathan Kroenert ? ? ? ? ? ? ?nathan.kroenert at sun.com ? ? ? ? //
> // Systems Engineer ? ? ? ? ? ? Phone: ?+61 3 9869-6255 ? ? ? ? //
> // Sun Microsystems ? ? ? ? ? ? Fax: ? ?+61 3 9869-6288 ? ? ? ? //
> // Level 7, 476 St. Kilda Road ?Mobile: 0419 305 456 ? ? ? ? ? ?//
> // Melbourne 3004 ? Victoria ? ?Australia ? ? ? ? ? ? ? ? ? ? ? //
> //////////////////////////////////////////////////////////////////
>

Tim

2009-Mar-12 20:56 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

On Thu, Mar 12, 2009 at 2:22 PM, Blake <blake.irvin at gmail.com> wrote:
> I''ve managed to get the data transfer to work by rearranging my
disks
> so that all of them sit on the integrated SATA controller.
>
> So, I feel pretty certain that this is either an issue with the
> Supermicro aoc-sat2-mv8 card, or with PCI-X on the motherboard (though
> I would think that the integrated SATA would also be using the PCI
> bus?).
>
> The motherboard, for those interested, is an HD8ME-2 (not, I now find
> after buying this box from Silicon Mechanics, a board that''s on
the
> Solaris HCL...)
>
>
<http://www.supermicro.com/Aplus/motherboard/Opteron2000/MCP55/h8dme-2.cfm
> >
>
> So I''m not considering one of LSI''s HBA''s - what
do list members think
> about this device:
>
>
<http://www.provantage.com/lsi-logic-lsi00117~7LSIG03X.htm<http://www.provantage.com/lsi-logic-lsi00117%7E7LSIG03X.htm>
> >
>
>
I believe the MCP55''s SATA controllers are actually PCI-E based.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090312/399456c0/attachment.html>

Dave

2009-Mar-12 22:21 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

Tim wrote:> 
> 
> On Thu, Mar 12, 2009 at 2:22 PM, Blake <blake.irvin at gmail.com 
> <mailto:blake.irvin at gmail.com>> wrote:
> 
>     I''ve managed to get the data transfer to work by rearranging
my disks
>     so that all of them sit on the integrated SATA controller.
> 
>     So, I feel pretty certain that this is either an issue with the
>     Supermicro aoc-sat2-mv8 card, or with PCI-X on the motherboard (though
>     I would think that the integrated SATA would also be using the PCI
>     bus?).
> 
>     The motherboard, for those interested, is an HD8ME-2 (not, I now find
>     after buying this box from Silicon Mechanics, a board that''s
on the
>     Solaris HCL...)
> 
>    
<http://www.supermicro.com/Aplus/motherboard/Opteron2000/MCP55/h8dme-2.cfm>
> 
>     So I''m not considering one of LSI''s HBA''s -
what do list members think
>     about this device:
> 
>     <http://www.provantage.com/lsi-logic-lsi00117~7LSIG03X.htm
>     <http://www.provantage.com/lsi-logic-lsi00117%7E7LSIG03X.htm>>
> 
> 
> 
> I believe the MCP55''s SATA controllers are actually PCI-E based.
I use Tyan 2927 motherboards. They have on-board nVidia MCP55 chipsets, 
which is the same chipset at the X4500 (IIRC). I wouldn''t trust the 
MCP55 chipset in OpenSolaris. I had random disk hangs even while the 
machine was mostly idle.

In Feb 2008 I bought AOC-SAT2-MV8 cards and moved all my drives to these 
add-in cards. I haven''t had any issues with drive hanging since. There 
does not seem to be any problems with the SAT2-MV8 under heavy load in 
my servers from what I''ve seen.

When the SuperMicro AOC-USAS-L8i came out later last year, I started 
using them instead. They work better than the SAT2-MV8s.

This card needs a 3U or bigger case:
http://www.supermicro.com/products/accessories/addon/AOC-USAS-L8i.cfm

This is the low profile card that will fit in a 2U:
http://www.supermicro.com/products/accessories/addon/AOC-USASLP-L8i.cfm

They both work in normal PCI-E slots on my Tyan 2927 mobos.

Finding good non-Sun hardware that works very well under OpenSolaris is 
frustrating to say the least. Good luck.

--
Dave

Miles Nordin

2009-Mar-12 22:30 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

>>>>> "b" == Blake <blake.irvin at gmail.com>
writes:
b> http://www.provantage.com/lsi-logic-lsi00117~7LSIG03X.htm

I''m having trouble matching up chips, cards, drivers, platforms, and
modes with the LSI stuff. The more I look at it the mroe confused I
get.

Platforms:
x86
SPARC

Drivers:
mpt
mega_sas
mfi

Chips:
1068 (SAS, PCI-X)
1068E (SAS, PCIe)
1078 ???
-- from supermicro, seems to be SAS, PCIe, with support for 256 -
512MB RAM instead of the 16 - 32MB RAM on the others
1030 (parallel scsi)

Cards:
LSI cards
http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/index.html
I love the way they use the numbers 3800 and 3080, so you are
constantly transposing them thus leaving google littered with all
this confusingly wrong information.

LSISAS3800X (PCI-X, external ports)
LSISAS3080X-R (PCI-X, internal ports)

LSISAS3801X (PCI-X, external ports)

LSISAS3801E (PCIe, external ports)
LSISAS3081E-R (PCIe, internal ports)

I would have thought -R meant ``suports RAID'''' but all I can
really
glean through the foggy marketing-glass behind which all the
information is hidden, is -R means ``all the ports are
internal''''.

Supermicro cards http://www.supermicro.com/products/accessories/index.cfm
wow, this is even more of a mess.
These are all UIO cards so I assume they have the PCIe bracket on backwards
AOC-USAS-L4i (PCIe, 4 internal 4 external)
AOC-USAS-L8i, AOC-USASLP-L8i (PCIe, internal ports)
based on 1068E
sounds similar to LSISAS3081E. Is that also 1068E?
supports RAID0, RAID1, RAID10
AOC-USAS-L4iR
identical to the above, but ``includes iButton''''
which is an old type of smartcard-like device with
sometimes crypto and javacard support.
apparently some kind of license key to
unlock RAID5? no L8iR exists though, only L4iR.
I have the L8i, and it does have an iButton socket
with no button in it.
AOC-USAS-H4iR
AOC-USAS-H8iR, AOC-USASLP-H8iR (PCIe, internal ports)
based on 1078
low-profile version has more memory than fullsize version?!

but here is the most fun thing about the supermicro cards. All
cards have one driver *EXCEPT* the L8i, which has three drivers
for three modes: IT, IR, and SR. When I google for this I find
notes on some of their integrated motherboards like:

* The onboard LSI 1068E supported SR and IT mode but not IR mode.

I also found this:

* SR = Software RAID IT = Integrate. Target mode. IR mode is not supported.

but no idea what the three modes are. searching for SAS SR IT IR
doesn''t work either, so it''s not some SAS thing. What
*is* it?

also there seem to be two different kinds of quad-SATA connector on
these SAS cards so there are two different kinds of octopus cable.

Questions:

* which chips are used by each of the LSI boards? I can guess, but
in particular LSISAS3800X and LSISAS3801X seem to be different
chips, while from the list of chips I''d have no choice but to guess
they are both 1068.

* which drivers work on x86 and which SPARC? I know some LSI cards
work in SPARC but maybe not all---do the drivers support the same
set of cards on both platforms? Or will normal cards not work in
SPARC for lack of Forth firmware to perform some LSI-proprietary
``initialization'''' ritual?

* which chips go with which drivers? Is it even that simple---will
adding an iButton RAID5 license to a SuperMicro board make the same
card change from mega_sas to mpt attachment, or something similar?

For example there is a bug here about a 1068E card which doesn''t
work, even though most 1068E cards do work:

http://bugs.opensolaris.org/view_bug.do?bug_id=6736187

Maybe the Solaris driver needs IR mode and won''t work with the
onboard supermicro chip which supports only ``software
raid''''
whatever that means, which is maybe denoted by SR? What does the
iButton unlock, then, features of IR mode which are abstracted from
the OS driver?

* What are SR, IT, and IR mode? Which modes do the Solaris drivers
use, or does it matter?

* Has someone found the tool mentioned here by some above-the-table
means, or only by request from LSI?:

http://www.opensolaris.org/jive/message.jspa?messageID=184811#184811

The mention that a SPARC version of the tool exists is encouraging.
The procedure to clear persistent mappings through the BIOS
obviously won''t work on SPARC.

Here are the notes I have so far:

-----8<-----> The driver for LSI''s MegaRAID SAS card is "mega_sas"
which
> was integrated into snv_88. It''s planned for backporting to
> a Solaris 10 update.There is also a BSD-licensed driver for that hardware, called
"mfi". It''s available from
http://www.itee.uq.edu.au/~dlg/mfi
> a scsi_vhci
> sort of driver for the LSI card in the Ultra {20,25}Well yes, that''s mpt(7d) as delivered into NV build 63, and
backported to Solaris 10 Update 5, found in patch 125081-14
and 125082-14. We''ve got support for both SAS (1064/E, 1068/E
and 1078) and Parallel SCSI (1030) chips from LSI in that driver.
SATA disks will always show up when attached to a SAS HBA,
because that''s one of the requirements of the SAS specification.
> LSISAS3801EI think you might actually be referring to the LSI SAS3801-R
You''re correct in that it''s not using mpt, but mega_sas or
mfi.
But it''s not a SATA framework driver.
-----8<-----

which doesn''t seem to be internally consistent.

note that SAS3801-R is not a product that exists on LSI''s site. There
is LSISAS3801X (without the -R), or LSISAS3081E-R, LSISAS3080X-R (with
-R).

note also that 1068, 1068E, and 1078 in the middle statement suggests
mpt supports every LSI card we are discussing (probably---LSI doesn''t
say what chips are on each card), while the last statement contradicts
it by mentioning a nonexistant card of the same era we are discussing
which doesn''t work with mpt driver.

--
READ CAREFULLY. By reading this fortune, you agree, on behalf of your employer,
to release me from all obligations and waivers arising from any and all
NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap,
browsewrap, confidentiality, non-disclosure, non-compete and acceptable use
policies ("BOGUS AGREEMENTS") that I have entered into with your
employer, its
partners, licensors, agents and assigns, in perpetuity, without prejudice to my
ongoing rights and privileges. You further represent that you have the
authority to release me from any BOGUS AGREEMENTS on behalf of your employer.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090312/c0bb2178/attachment.bin>

Nathan Kroenert

2009-Mar-12 23:18 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

For what it''s worth, I have been running Nevada (so, same kernel as 
opensolaris) for ages (at least 18 months) on a Gigabyte board with the 
MCP55 chipset and it''s been flawless.

I liked it so much, I bought it''s newer brother, based on the nvidia 
750SLI chipset...   M750SLI-DS4

Cheers!

Nathan.


On 13/03/09 09:21 AM, Dave wrote:> 
> 
> Tim wrote:
>>
>>
>> On Thu, Mar 12, 2009 at 2:22 PM, Blake <blake.irvin at gmail.com 
>> <mailto:blake.irvin at gmail.com>> wrote:
>>
>>     I''ve managed to get the data transfer to work by
rearranging my disks
>>     so that all of them sit on the integrated SATA controller.
>>
>>     So, I feel pretty certain that this is either an issue with the
>>     Supermicro aoc-sat2-mv8 card, or with PCI-X on the motherboard 
>> (though
>>     I would think that the integrated SATA would also be using the PCI
>>     bus?).
>>
>>     The motherboard, for those interested, is an HD8ME-2 (not, I now
find
>>     after buying this box from Silicon Mechanics, a board
that''s on the
>>     Solaris HCL...)
>>
>>     
>>
<http://www.supermicro.com/Aplus/motherboard/Opteron2000/MCP55/h8dme-2.cfm>
>>
>>
>>     So I''m not considering one of LSI''s
HBA''s - what do list members
>> think
>>     about this device:
>>
>>     <http://www.provantage.com/lsi-logic-lsi00117~7LSIG03X.htm
>>    
<http://www.provantage.com/lsi-logic-lsi00117%7E7LSIG03X.htm>>
>>
>>
>>
>> I believe the MCP55''s SATA controllers are actually PCI-E
based.
> 
> I use Tyan 2927 motherboards. They have on-board nVidia MCP55 chipsets, 
> which is the same chipset at the X4500 (IIRC). I wouldn''t trust
the
> MCP55 chipset in OpenSolaris. I had random disk hangs even while the 
> machine was mostly idle.
> 
> In Feb 2008 I bought AOC-SAT2-MV8 cards and moved all my drives to these 
> add-in cards. I haven''t had any issues with drive hanging since.
There
> does not seem to be any problems with the SAT2-MV8 under heavy load in 
> my servers from what I''ve seen.
> 
> When the SuperMicro AOC-USAS-L8i came out later last year, I started 
> using them instead. They work better than the SAT2-MV8s.
> 
> This card needs a 3U or bigger case:
> http://www.supermicro.com/products/accessories/addon/AOC-USAS-L8i.cfm
> 
> This is the low profile card that will fit in a 2U:
> http://www.supermicro.com/products/accessories/addon/AOC-USASLP-L8i.cfm
> 
> They both work in normal PCI-E slots on my Tyan 2927 mobos.
> 
> Finding good non-Sun hardware that works very well under OpenSolaris is 
> frustrating to say the least. Good luck.
> 
> -- 
> Dave
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 


//////////////////////////////////////////////////////////////////
// Nathan Kroenert		nathan.kroenert at sun.com		//
// Senior Systems Engineer	Phone:	+61 3 9869 6255         //
// Global Systems Engineering	Fax:	+61 3 9869 6288         //
// Level 7, 476 St. Kilda Road 					//
// Melbourne 3004   Victoria	Australia			//
//////////////////////////////////////////////////////////////////

Will Murnane

2009-Mar-12 23:42 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

On Thu, Mar 12, 2009 at 18:30, Miles Nordin <carton at ivy.net>
wrote:> ?I love the way they use the numbers 3800 and 3080, so you are
> ?constantly transposing them thus leaving google littered with all
> ?this confusingly wrong information.Think of the middle two digits as (number of external ports, number of
internal ports).  For example, I have a 3442E-R which has 4 internal
and 4 external ports, the 3800 has 8 external ports and 0 internal,
and so forth.  One place this breaks down is with cards like the 8888;
it has a total of 8 ports, any group of 4 of which can be mapped to
internal or external ports.
> ? AOC-USAS-L4iR
> ? ? ? identical to the above, but ``includes iButton''''
> ? ? ? which is an old type of smartcard-like device with
> ? ? ? sometimes crypto and javacard support.
> ? ? ? apparently some kind of license key to
> ? ? ? unlock RAID5? ?no L8iR exists though, only L4iR.
> ? ? ? I have the L8i, and it does have an iButton socket
> ? ? ? with no button in it.I think the iButton is just used as an unlock code for the builtin
RAID 5 functionality.  Nothing the end user cares about, unless they
want RAID and have to spend the extra money.
> ? ? * SR = Software RAID IT = Integrate. Target mode. IR mode is not
supported.Integrated target mode lets you export some storage attached to the
host system (through another adapter, presumably) as a storage device.
 IR mode is almost certainly Internal RAID, which that card doesn''t
have support for.
> also there seem to be two different kinds of quad-SATA connector on
> these SAS cards so there are two different kinds of octopus cable.Yes---SFF-8484 and SFF-8087 are the key words.
> SATA disks will always show up when attached to a SAS HBA,
> because that''s one of the requirements of the SAS specification.I''m not sure what you mean by this.  SAS controllers can control SATA
disks, and interact with them.  They don''t just "show up";
they''re
first-class citizens.

Will

Miles Nordin

2009-Mar-13 02:24 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

>>>>> "wm" == Will Murnane <will.murnane at
gmail.com> writes:
>> ? ? * SR = Software RAID IT = Integrate. Target mode. IR mode
>> is not supported.
wm> Integrated target mode lets you export some storage attached
wm> to the host system (through another adapter, presumably) as a
wm> storage device. IR mode is almost certainly Internal RAID,
wm> which that card doesn''t have support for.

no, the supermicro page for AOC-USAS-L8i does claim support for all
three, and supermicro has an ``IR driver'''' available for
download for
Linux and Windows, or at least a link to one.

I''m trying to figure out what''s involved in determining and
switching
modes, why you''d want to switch them, what cards support which modes,
which solaris drivers support which modes, u.s.w.

The answer may be very simple, like ``the driver supports only IR.
Most cards support IR, and cards that don''t support IR won''t
work. IR
can run in single-LUN mode. Some IR cards support RAID5, others
support only RAID 0, 1, 10.'''' Or it could be ``the driver
supports
only SR. The driver is what determines the mode, and it does this by
loading firmware into the card, and the first step in initializing the
card is always for the driver to load in a firmware blob. All
currently-produced cards support SR.'''' so...actually, now
that I say
it, I guess the answer cannot be very simple. It''s going to have to
be a little complicated.

Anyway, I can guess, too. I was hoping someone would know for sure
off-hand.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090312/9264a7b7/attachment.bin>

James C. McPherson

2009-Mar-13 02:42 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

On Thu, 12 Mar 2009 22:24:12 -0400
Miles Nordin <carton at Ivy.NET> wrote:
> >>>>> "wm" == Will Murnane <will.murnane at
gmail.com> writes:
> 
>     >> ? ? * SR = Software RAID IT = Integrate. Target mode. IR mode
>     >> is not supported.
>     wm> Integrated target mode lets you export some storage attached
>     wm> to the host system (through another adapter, presumably) as a
>     wm> storage device.  IR mode is almost certainly Internal RAID,
>     wm> which that card doesn''t have support for.
> 
> no, the supermicro page for AOC-USAS-L8i does claim support for all
> three, and supermicro has an ``IR driver'''' available for
download for
> Linux and Windows, or at least a link to one.
> 
> I''m trying to figure out what''s involved in determining
and switching
> modes, why you''d want to switch them, what cards support which
modes,
> which solaris drivers support which modes, u.s.w.
> 
> The answer may be very simple, like ``the driver supports only IR.
> Most cards support IR, and cards that don''t support IR
won''t work.  IR
> can run in single-LUN mode.  Some IR cards support RAID5, others
> support only RAID 0, 1, 10.''''  Or it could be ``the
driver supports
> only SR.  The driver is what determines the mode, and it does this by
> loading firmware into the card, and the first step in initializing the
> card is always for the driver to load in a firmware blob.  All
> currently-produced cards support SR.''''  so...actually,
now that I say
> it, I guess the answer cannot be very simple.  It''s going to have
to
> be a little complicated.
> Anyway, I can guess, too.  I was hoping someone would know for sure
> off-hand.

Hi Miles,
the mpt(7D) driver supports that card. mpt(7D) supports both
IT and IR firmware variants. You can find out the specifics
for what RAID volume levels are supported by reading the 
raidctl(1M) manpage. I don''t think you can switch between IT
and IR firmware, but not having needed to know this before,
I haven''t tried it.


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Blake Irvin

2009-Mar-13 14:26 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

This is really great information, though most of the controllers  
mentioned aren''t on the OpenSolaris HCL.  Seems like that should be  
corrected :)

My thanks to the community for their support.

On Mar 12, 2009, at 10:42 PM, "James C. McPherson" <James.McPherson
at Sun.COM
 > wrote:
> On Thu, 12 Mar 2009 22:24:12 -0400
> Miles Nordin <carton at Ivy.NET> wrote:
>
>>>>>>> "wm" == Will Murnane <will.murnane at
gmail.com> writes:
>>
>>>>     * SR = Software RAID IT = Integrate. Target mode. IR mode
>>>> is not supported.
>>    wm> Integrated target mode lets you export some storage attached
>>    wm> to the host system (through another adapter, presumably) as a
>>    wm> storage device.  IR mode is almost certainly Internal RAID,
>>    wm> which that card doesn''t have support for.
>>
>> no, the supermicro page for AOC-USAS-L8i does claim support for all
>> three, and supermicro has an ``IR driver'''' available
for download for
>> Linux and Windows, or at least a link to one.
>>
>> I''m trying to figure out what''s involved in
determining and switching
>> modes, why you''d want to switch them, what cards support which
modes,
>> which solaris drivers support which modes, u.s.w.
>>
>> The answer may be very simple, like ``the driver supports only IR.
>> Most cards support IR, and cards that don''t support IR
won''t work.
>> IR
>> can run in single-LUN mode.  Some IR cards support RAID5, others
>> support only RAID 0, 1, 10.''''  Or it could be ``the
driver supports
>> only SR.  The driver is what determines the mode, and it does this by
>> loading firmware into the card, and the first step in initializing  
>> the
>> card is always for the driver to load in a firmware blob.  All
>> currently-produced cards support SR.'''' 
so...actually, now that I say
>> it, I guess the answer cannot be very simple.  It''s going to
have to
>> be a little complicated.
>> Anyway, I can guess, too.  I was hoping someone would know for sure
>> off-hand.
>
>
> Hi Miles,
> the mpt(7D) driver supports that card. mpt(7D) supports both
> IT and IR firmware variants. You can find out the specifics
> for what RAID volume levels are supported by reading the
> raidctl(1M) manpage. I don''t think you can switch between IT
> and IR firmware, but not having needed to know this before,
> I haven''t tried it.
>
>
> James C. McPherson
> --
> Senior Kernel Software Engineer, Solaris
> Sun Microsystems
> http://blogs.sun.com/jmcp    http://www.jmcp.homeunix.com/blog
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

arnaud

2009-Dec-18 22:04 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

Hi folks,
  I was trying to load a large file in /tmp so that a process that 
parses it wouldn''t be limited by a disk throughput bottleneck. My rig 
here has only 12GB of RAM and the file I copied is about 12GB as well.
Before the copied finished, my system restarted.
I''m pretty up to date, the system has b129 running.

Googling about this I stumbled upon this thread dating back to March ...
http://mail.opensolaris.org/pipermail/zfs-discuss/2009-March/027264.html

Since this seems to be a known issue, and pretty serious in my book, is 
there a fix pending or is that not investigated ?

thanks in advance and happy holidays to all
-=arnaud=-

Ian Collins

2009-Dec-20 19:04 UTC

head link

[zfs-discuss] reboot when copying large amounts of data

arnaud wrote:> Hi folks,
>  I was trying to load a large file in /tmp so that a process that 
> parses it wouldn''t be limited by a disk throughput bottleneck. My
rig
> here has only 12GB of RAM and the file I copied is about 12GB as well.
> Before the copied finished, my system restarted.
> I''m pretty up to date, the system has b129 running.
>12GB of ECC RAM? 

This kind of spontaneous reboot with crash dumps enabled (do you have 
them enabled?) tend to be hardware related.  The most common bit of bad 
hardware is RAM...

-- 
Ian.

zfs discuss - Mar 2009 - reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data

[zfs-discuss] reboot when copying large amounts of data