Edmund White
2011-Jun-11 13:35 UTC
[zfs-discuss] Impact of L2ARC device failure and SSD recommendations
Posted in greater detail at Server Fault - http://serverfault.com/q/277966/13325 I have an HP ProLiant DL380 G7 system running NexentaStor. The server has 36GB RAM, 2 LSI 9211-8i SAS controllers (no SAS expanders), 2 SAS system drives, 12 SAS data drives, a hot-spare disk, an Intel X25-M L2ARC cache and a DDRdrive PCI ZIL accelerator. This system serves NFS to multiple VMWare hosts. I also have about 90-100GB of deduplicated data on the array. I''ve had two incidents where performance tanked suddenly, leaving the VM guests and Nexenta SSH/Web consoles inaccessible and requiring a full reboot of the array to restore functionality. In both cases, it was the Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed to alert me on the cache failure, however the general ZFS FMA alert was visible on the (unresponsive) console screen. The "zpool status" output showed: cache c6t5001517959467B45d0 FAULTED 2 542 0 too many errors This did not trigger any alerts from within Nexenta. I was under the impression that an L2ARC failure would not impact the system. But in this case, it was the culprit. I''ve never seen any recommendations to RAID L2ARC for resiliency. Removing the bad SSD entirely from the server got me back running, but I''m concerned about the impact of the device failure and the lack of notification from NexentaStor. What''s the current best-choice SSD for L2ARC cache applications these days? It seems as though the Intel units are no longer well-regarded. -- Edmund White -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110611/a31da62c/attachment.html>
Pasi Kärkkäinen
2011-Jun-11 15:15 UTC
[zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Sat, Jun 11, 2011 at 08:35:19AM -0500, Edmund White wrote:> Posted in greater detail at Server Fault > - [1]http://serverfault.com/q/277966/13325 > > I have an HP ProLiant DL380 G7 system running NexentaStor. The server has > 36GB RAM, 2 LSI 9211-8i SAS controllers (no SAS expanders), 2 SAS system > drives, 12 SAS data drives, a hot-spare disk, an Intel X25-M L2ARC cache > and a DDRdrive PCI ZIL accelerator. This system serves NFS to multiple > VMWare hosts. I also have about 90-100GB of deduplicated data on the > array. > > I''ve had two incidents where performance tanked suddenly, leaving the VM > guests and Nexenta SSH/Web consoles inaccessible and requiring a full > reboot of the array to restore functionality. In both cases, it was the > Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed to > alert me on the cache failure, however the general ZFS FMA alert was > visible on the (unresponsive) console screen. > > The "zpool status" output showed: > > cache > c6t5001517959467B45d0 FAULTED 2 542 0 too many errors > > This did not trigger any alerts from within Nexenta. > > I was under the impression that an L2ARC failure would not impact the > system. But in this case, it was the culprit. I''ve never seen any > recommendations to RAID L2ARC for resiliency. Removing the bad SSD > entirely from the server got me back running, but I''m concerned about the > impact of the device failure and the lack of notification from > NexentaStor. > > What''s the current best-choice SSD for L2ARC cache applications these > days? It seems as though the Intel units are no longer well-regarded. >IIRC recently there was discussion on this list about firmware bug on the Intel X25 SSDs causing them to fail under high disk IO with "reset storms". Maybe you''re hitting that firmware bug. -- Pasi
Edmund White
2011-Jun-11 15:28 UTC
[zfs-discuss] Impact of L2ARC device failure and SSD recommendations
So, can this be fixed in firmware? How can I determine if the drive is actually bad? -- Edmund White ewwhite at mac.com On 6/11/11 10:15 AM, "Pasi K?rkk?inen" <pasik at iki.fi> wrote:>On Sat, Jun 11, 2011 at 08:35:19AM -0500, Edmund White wrote: >> Posted in greater detail at Server Fault >> - [1]http://serverfault.com/q/277966/13325 >> >> I have an HP ProLiant DL380 G7 system running NexentaStor. The >>server has >> 36GB RAM, 2 LSI 9211-8i SAS controllers (no SAS expanders), 2 SAS >>system >> drives, 12 SAS data drives, a hot-spare disk, an Intel X25-M L2ARC >>cache >> and a DDRdrive PCI ZIL accelerator. This system serves NFS to >>multiple >> VMWare hosts. I also have about 90-100GB of deduplicated data on the >> array. >> >> I''ve had two incidents where performance tanked suddenly, leaving >>the VM >> guests and Nexenta SSH/Web consoles inaccessible and requiring a full >> reboot of the array to restore functionality. In both cases, it was >>the >> Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor >>failed to >> alert me on the cache failure, however the general ZFS FMA alert was >> visible on the (unresponsive) console screen. >> >> The "zpool status" output showed: >> >> cache >> c6t5001517959467B45d0 FAULTED 2 542 0 too many errors >> >> This did not trigger any alerts from within Nexenta. >> >> I was under the impression that an L2ARC failure would not impact the >> system. But in this case, it was the culprit. I''ve never seen any >> recommendations to RAID L2ARC for resiliency. Removing the bad SSD >> entirely from the server got me back running, but I''m concerned >>about the >> impact of the device failure and the lack of notification from >> NexentaStor. >> >> What''s the current best-choice SSD for L2ARC cache applications these >> days? It seems as though the Intel units are no longer well-regarded. >> > >IIRC recently there was discussion on this list about firmware bug >on the Intel X25 SSDs causing them to fail under high disk IO with "reset >storms". > >Maybe you''re hitting that firmware bug. > >-- Pasi >
Jim Klimov
2011-Jun-11 16:26 UTC
[zfs-discuss] Impact of L2ARC device failure and SSD recommendations
2011-06-11 19:15, Pasi K?rkk?inen ?????:> On Sat, Jun 11, 2011 at 08:35:19AM -0500, Edmund White wrote: >> I''ve had two incidents where performance tanked suddenly, leaving the VM >> guests and Nexenta SSH/Web consoles inaccessible and requiring a full >> reboot of the array to restore functionality. In both cases, it was the >> Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed to >> alert me on the cache failure, however the general ZFS FMA alert was >> visible on the (unresponsive) console screen. >> >> The "zpool status" output showed: >> >> cache >> c6t5001517959467B45d0 FAULTED 2 542 0 too many errors >> >> This did not trigger any alerts from within Nexenta. >> >> I was under the impression that an L2ARC failure would not impact the >> system. But in this case, it was the culprit. I''ve never seen any >> recommendations to RAID L2ARC for resiliency. Removing the bad SSD >> entirely from the server got me back running, but I''m concerned about the >> impact of the device failure and the lack of notification from >> NexentaStor. > IIRC recently there was discussion on this list about firmware bug > on the Intel X25 SSDs causing them to fail under high disk IO with "reset storms".Even if so, this does not forgive ZFS hanging - especially if it detected the drive failure, and especially if this drive is not required for redundant operation. I''ve seen similar bad behaviour on my oi_148a box when I tested USB flash devices as L2ARC caches and occasionally they died by slightly moving out of the USB socket due to vibration or whatever reason ;) Similarly, this oi_148a box hung upon loss of SATA connection to a drive in the raidz2 disk set due to unreliable cable connectors, while it should have stalled IOs to that pool but otherwise the system should have remained remain responsive (tested failmode=continue and failmode=wait on different occasions). So I can relate - these things happen, they do annoy, and I hope they will be fixed sometime soon so that ZFS matches its docs and promises ;) //Jim Klimov
Pasi Kärkkäinen
2011-Jun-12 11:05 UTC
[zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Sat, Jun 11, 2011 at 08:26:34PM +0400, Jim Klimov wrote:> 2011-06-11 19:15, Pasi K?rkk?inen ??????????: >> On Sat, Jun 11, 2011 at 08:35:19AM -0500, Edmund White wrote: >>> I''ve had two incidents where performance tanked suddenly, leaving the VM >>> guests and Nexenta SSH/Web consoles inaccessible and requiring a full >>> reboot of the array to restore functionality. In both cases, it was the >>> Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed to >>> alert me on the cache failure, however the general ZFS FMA alert was >>> visible on the (unresponsive) console screen. >>> >>> The "zpool status" output showed: >>> >>> cache >>> c6t5001517959467B45d0 FAULTED 2 542 0 too many errors >>> >>> This did not trigger any alerts from within Nexenta. >>> >>> I was under the impression that an L2ARC failure would not impact the >>> system. But in this case, it was the culprit. I''ve never seen any >>> recommendations to RAID L2ARC for resiliency. Removing the bad SSD >>> entirely from the server got me back running, but I''m concerned about the >>> impact of the device failure and the lack of notification from >>> NexentaStor. >> IIRC recently there was discussion on this list about firmware bug >> on the Intel X25 SSDs causing them to fail under high disk IO with "reset storms". > Even if so, this does not forgive ZFS hanging - especially > if it detected the drive failure, and especially if this drive > is not required for redundant operation. > > I''ve seen similar bad behaviour on my oi_148a box when > I tested USB flash devices as L2ARC caches and > occasionally they died by slightly moving out of the > USB socket due to vibration or whatever reason ;) > > Similarly, this oi_148a box hung upon loss of SATA > connection to a drive in the raidz2 disk set due to > unreliable cable connectors, while it should have > stalled IOs to that pool but otherwise the system > should have remained remain responsive (tested > failmode=continue and failmode=wait on different > occasions). > > So I can relate - these things happen, they do annoy, > and I hope they will be fixed sometime soon so that > ZFS matches its docs and promises ;) >True, definitely sounds like a bug in ZFS aswell.. -- Pasi
Richard Elling
2011-Jun-12 19:52 UTC
[zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Jun 11, 2011, at 6:35 AM, Edmund White wrote:> Posted in greater detail at Server Fault - http://serverfault.com/q/277966/13325 >Replied in greater detail at same.> I have an HP ProLiant DL380 G7 system running NexentaStor. The server has 36GB RAM, 2 LSI 9211-8i SAS controllers (no SAS expanders), 2 SAS system drives, 12 SAS data drives, a hot-spare disk, an Intel X25-M L2ARC cache and a DDRdrive PCI ZIL accelerator. This system serves NFS to multiple VMWare hosts. I also have about 90-100GB of deduplicated data on the array. > > I''ve had two incidents where performance tanked suddenly, leaving the VM guests and Nexenta SSH/Web consoles inaccessible and requiring a full reboot of the array to restore functionality. >The reboot is your decision, the software will, eventually, recover.> In both cases, it was the Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed to alert me on the cache failure, however the general ZFS FMA alert was visible on the (unresponsive) console screen. > >NexentaStor fault triggers run in addition to the existing FMA and syslog services.> The "zpool status" output showed: > > > cache > c6t5001517959467B45d0 FAULTED 2 542 0 too many errors > > This did not trigger any alerts from within Nexenta. > >The NexentaStor volume-check runner looks for zpool status error messages. Check your configuration for the runner schedule, by default it is hourly.> I was under the impression that an L2ARC failure would not impact the system. >With all due respect, that is a naive assumption. Any system failure can impact the system. The worst kinds of failures are those that impact performance. In this case, the broken SSD firmware causes very slow response to I/O requests. It does not return an error code that says "I''m broken" it just responds very slowly, perhaps after other parts of the system ask it to reset and retry a few times.> But in this case, it was the culprit. I''ve never seen any recommendations to RAID L2ARC for resiliency. Removing the bad SSD entirely from the server got me back running, but I''m concerned about the impact of the device failure and the lack of notification from NexentaStor. > >We have made some improvements in notification for this type of failure in the 3.1 release. Why? Because we have seen a large number of these errors from various disk and SSD manufacturers recently. You will notice that Nexenta does not support these SSDs behind SAS expanders for this very reason. At the end of the day, resolution is to get the device fixed or replaced. Contact your hardware provider for details.> What''s the current best-choice SSD for L2ARC cache applications these days? It seems as though the Intel units are no longer well-regarded. > >No device is perfect. Some have better firmware, components, or design than others. YMMV. -- richard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110612/d75df0df/attachment-0001.html>
Richard Elling
2011-Jun-12 19:57 UTC
[zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Jun 11, 2011, at 9:26 AM, Jim Klimov wrote:> 2011-06-11 19:15, Pasi K?rkk?inen ?????: >> On Sat, Jun 11, 2011 at 08:35:19AM -0500, Edmund White wrote: >>> I''ve had two incidents where performance tanked suddenly, leaving the VM >>> guests and Nexenta SSH/Web consoles inaccessible and requiring a full >>> reboot of the array to restore functionality. In both cases, it was the >>> Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed to >>> alert me on the cache failure, however the general ZFS FMA alert was >>> visible on the (unresponsive) console screen. >>> >>> The "zpool status" output showed: >>> >>> cache >>> c6t5001517959467B45d0 FAULTED 2 542 0 too many errors >>> >>> This did not trigger any alerts from within Nexenta. >>> >>> I was under the impression that an L2ARC failure would not impact the >>> system. But in this case, it was the culprit. I''ve never seen any >>> recommendations to RAID L2ARC for resiliency. Removing the bad SSD >>> entirely from the server got me back running, but I''m concerned about the >>> impact of the device failure and the lack of notification from >>> NexentaStor. >> IIRC recently there was discussion on this list about firmware bug >> on the Intel X25 SSDs causing them to fail under high disk IO with "reset storms". > Even if so, this does not forgive ZFS hanging - especially > if it detected the drive failure, and especially if this drive > is not required for redundant operation.How long should it wait? Before you answer, read through the thread: http://lists.illumos.org/pipermail/developer/2011-April/001996.html Then add your comments :-) -- richard
Jim Klimov
2011-Jun-12 23:18 UTC
[zfs-discuss] Impact of L2ARC device failure and SSD recommendations
2011-06-12 23:57, Richard Elling wrote:> > How long should it wait? Before you answer, read through the thread: > http://lists.illumos.org/pipermail/developer/2011-April/001996.html > Then add your comments :-) > -- richardInteresting thread. I did not quite get the resentment against a tunable value instead of a hard-coded #define, though. Especially if we might want to somehow tune it per-device, i.e. CDROM, enterprise SAS and some commodity drive or a USB stick (or a VMWare emulated HDD, as Ceri pointed out) might all be plugged into the same box and require different timeouts only the sysadmin might know about (the numeric values per-device). So I''d rather go with some hardcoded default and many tuned lines in sd.conf, probably. But the point of my previous comment was that, according to the original poster, after a while his disk did get marked as "faulted" or "offlined". IF this happened during the system''s initial uptime, but it froze anyway, it it a problem. What I do not know is if he rebooted the box within the 5 minutes set aside for the timeout, or if some other processes gave up during the 5 minutes of no IO and effectively hung the system. If it is somehow the latter - that the inaccessible drive did (lead to) hang(ing) the system past any set IO retry timeouts - that is a bug, I think. But maybe I''m just too annoyed with my box hanging with a more-or-less reproducible scenario, and now I''m barking up any tree that looks like system freeze related to IO ;) //Jim
Richard Elling
2011-Jun-12 23:34 UTC
[zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Jun 12, 2011, at 4:18 PM, Jim Klimov wrote:> 2011-06-12 23:57, Richard Elling wrote: >> >> How long should it wait? Before you answer, read through the thread: >> http://lists.illumos.org/pipermail/developer/2011-April/001996.html >> Then add your comments :-) >> -- richard > > Interesting thread. I did not quite get the resentment against > a tunable value instead of a hard-coded #define, though.Tunables are evil. They increase complexity and lead to local optimizations that interfere with systemic optimizations.> Especially if we might want to somehow tune it per-device, > i.e. CDROM, enterprise SAS and some commodity drive or a > USB stick (or a VMWare emulated HDD, as Ceri pointed out) > might all be plugged into the same box and require different > timeouts only the sysadmin might know about (the numeric > values per-device). So I''d rather go with some hardcoded > default and many tuned lines in sd.conf, probably.yuck. I''d rather have my eye poked out with a sharp stick.> But the point of my previous comment was that, according > to the original poster, after a while his disk did get > marked as "faulted" or "offlined". IF this happened > during the system''s initial uptime, but it froze anyway, > it it a problem. > > What I do not know is if he rebooted the box within the > 5 minutes set aside for the timeout, or if some other > processes gave up during the 5 minutes of no IO and > effectively hung the system.Not likely. Much more likely that that which you were expecting was blocked.> If it is somehow the latter - that the inaccessible drive > did (lead to) hang(ing) the system past any set IO retry > timeouts - that is a bug, I think. > > But maybe I''m just too annoyed with my box hanging with > a more-or-less reproducible scenario, and now I''m barking > up any tree that looks like system freeze related to IO ;)Yep, a common reaction. I think we can be more creative... -- richard
Edmund White
2011-Jun-13 00:04 UTC
[zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On 6/12/11 6:18 PM, "Jim Klimov" <jimklimov at cos.ru> wrote:>2011-06-12 23:57, Richard Elling wrote: >> >> How long should it wait? Before you answer, read through the thread: >> http://lists.illumos.org/pipermail/developer/2011-April/001996.html >> Then add your comments :-) >> -- richard > >But the point of my previous comment was that, according >to the original poster, after a while his disk did get >marked as "faulted" or "offlined". IF this happened >during the system''s initial uptime, but it froze anyway, >it it a problem. > >What I do not know is if he rebooted the box within the >5 minutes set aside for the timeout, or if some other >processes gave up during the 5 minutes of no IO and >effectively hung the system. > >If it is somehow the latter - that the inaccessible drive >did (lead to) hang(ing) the system past any set IO retry >timeouts - that is a bug, I think. >Here''s the timeline: - The Intel X25-M was marked "FAULTED" Monday evening, 6pm. This was not detected by NexentaStor. - The storage system performance diminished at 9am the next morning. Intermittent spikes in system load (of the VMs hosted on the unit). - By 11am, the Nexenta interface and console were unresponsive and the virtual machines dependent on the underlying storage stalled completely. - At 12pm, I gained physical access to the server, but I could not acquire console access (shell or otherwise). I did see the FMA error output on the screen indicating the actual device FAULT time. - I powered the system off, removed the Intel X-25M, and powered back on. The VMs picked up where they left off and the system stabilized. The total impact to end-users was 3 hours of either poor performance or straight downtime. -- Edmund White ewwhite at mac.com
Richard Elling
2011-Jun-13 00:25 UTC
[zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Jun 12, 2011, at 5:04 PM, Edmund White wrote:> On 6/12/11 6:18 PM, "Jim Klimov" <jimklimov at cos.ru> wrote: >> 2011-06-12 23:57, Richard Elling wrote: >>> >>> How long should it wait? Before you answer, read through the thread: >>> http://lists.illumos.org/pipermail/developer/2011-April/001996.html >>> Then add your comments :-) >>> -- richard >> >> But the point of my previous comment was that, according >> to the original poster, after a while his disk did get >> marked as "faulted" or "offlined". IF this happened >> during the system''s initial uptime, but it froze anyway, >> it it a problem. >> >> What I do not know is if he rebooted the box within the >> 5 minutes set aside for the timeout, or if some other >> processes gave up during the 5 minutes of no IO and >> effectively hung the system. >> >> If it is somehow the latter - that the inaccessible drive >> did (lead to) hang(ing) the system past any set IO retry >> timeouts - that is a bug, I think. >> > > Here''s the timeline: > > - The Intel X25-M was marked "FAULTED" Monday evening, 6pm. This was not > detected by NexentaStor.Is the volume-check runner enabled? All of the check runner results are logged in the report database and sent to the system administrator via email. I will assume that you have configured email for delivery, as it is a required step in the installation procedure. In any case, a disk declared FAULTED is no longer used by ZFS, except when a pool is cleared. The volume-check runner can do this on your behalf, if it is configured to do so. See Data Management -> Runners -> volume-check And, of course, these actions are recorded in the logs and report database.> - The storage system performance diminished at 9am the next morning. > Intermittent spikes in system load (of the VMs hosted on the unit).This is consistent with reset storms.> - By 11am, the Nexenta interface and console were unresponsive and the > virtual machines dependent on the underlying storage stalled completely.Also consistent with reset storms.> - At 12pm, I gained physical access to the server, but I could not acquire > console access (shell or otherwise). I did see the FMA error output on the > screen indicating the actual device FAULT time. > - I powered the system off, removed the Intel X-25M, and powered back on. > The VMs picked up where they left off and the system stabilized. > > The total impact to end-users was 3 hours of either poor performance or > straight downtime.Yes, this is consistent with reset storms. Older Intel SSDs are not the only devices that handle this poorly. In my experience a number of SATA devices are poorly designed :-( -- richard
Edmund White
2011-Jun-13 00:52 UTC
[zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On 6/12/11 7:25 PM, "Richard Elling" <richard.elling at gmail.com> wrote:>>> >> >> Here''s the timeline: >> >> - The Intel X25-M was marked "FAULTED" Monday evening, 6pm. This was not >> detected by NexentaStor. > >Is the volume-check runner enabled? All of the check runner results are >logged in >the report database and sent to the system administrator via email. I >will assume >that you have configured email for delivery, as it is a required step in >the installation >procedure. > >In any case, a disk declared FAULTED is no longer used by ZFS, except >when a >pool is cleared. The volume-check runner can do this on your behalf, if >it is >configured to do so. See Data Management -> Runners -> volume-check >And, of course, these actions are recorded in the logs and report >database. > >-- richardI checked seven of my NexentaStor installations (3.0.4 and 3.0.5). Six of them had the disk-check fault trigger disabled by default. volume-check is enabled on all and is set to run hourly. Email notification is configured, and I actively receive other alerts (DDT table, auto-sync) and reports. -- Edmund White ewwhite at mac.com 847-530-1605