In August last year I posted this bug, a brief summary of which would be that ZFS still accepts writes to a faulted pool, causing data loss, and potentially silent data loss: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6735932 There have been no updates to the bug since September, and nobody seems to be assigned to it. Can somebody let me know what''s happening with this please. And yes, this is definitely accepting writes that it shouldn''t. For over an hour I was able to carry out writes to a pool where *all* of its drives had been *physically removed* from the server. -- This message posted from opensolaris.org
Ok, I noticed somebody''s flagged the bug as ''retest'', I don''t know whether that''s aimed at Sun or myself, but either way I''m installing snv_106 on a test machine now and will check whether this is still an issue. -- This message posted from opensolaris.org
Ok, it''s still happening in snv_106: I plugged a USB drive into a freshly installed system, and created a single disk zpool on it: # zpool create usbtest c1t0d0 I opened the (nautilus?) file manager in gnome, and copied the /etc/X11 folder to it. I then copied the /etc/apache folder to it, and at 4:05pm, disconnected the drive. At this point there are *no* warnings on screen, or any indication that there is a problem. To check that the pool was still working, I created duplicates of the two folders on that drive. That worked without any errors, although the drive was physically removed. 4:07pm I ran zpool status, the pool is actually showing as unavailable, so at least that has happened faster than my last test. The folder is still open in gnome, however any attempt to copy files to or from it just hangs the file transfer operation window. 4:09pm /usbtest is still visible in gnome Also, I can still open a console and use the folder: # cd usbtest # ls X11 X11 (copy) apache apache (copy) I also tried: # mv X11 X11-test That hung, but I saw the X11 folder disappear from the graphical file manager, so the system still believes something is working with this pool. The main GUI is actually a little messed up now. The gnome file manager window looking at the /usbtest folder has hung. Also, right-clicking the desktop to open a new terminal hangs, leaving the right-click menu on screen. The main menu still works though, and I can still open a new terminal. 4:19pm Commands such as ls are finally hanging on the pool. At this point I tried to reboot, but it appears that isn''t working. I used system monitor to kill everything I had running and tried again, but that didn''t help. I had to physically power off the system to reboot. After the reboot, as expected, /usbtest still exists (even though the drive is disconnected). I removed that folder and connected the drive. ZFS detects the insertion and automounts the drive, but I find that although the pool is showing as online, and the filesystem shows as mounted at /usbtest. But the /usbtest directory doesn''t exist. I had to export and import the pool to get it available, but as expected, I''ve lost data: # cd usbtest # ls X11 even worse, zfs is completely unaware of this: # zpool status -v usbtest pool: usbtest state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM usbtest ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 errors: No known data errors So in summary, there are a good few problems here, many of which I''ve already reported as bugs: 1. ZFS still accepts read and write operations for a faulted pool, causing data loss that isn''t necessarily reported by zpool status. 2. Even after writes start to hang, it''s still possible to continue reading data from a faulted pool. 3. A faulted pool causes unwanted side effects in the GUI, making the system hard to use, and impossible to reboot. 4. After a hard reset, ZFS does not recover cleanly. Unused mountpoints are left behind. 5. Automatic mounting of pools doesn''t seem to work reliably. 6. zfs status doesn''t inform of any problems mounting the pool. -- This message posted from opensolaris.org
Just another thought off the back of this, would it be possible to modify zpool status to also: - Generate a warning if a pool has not been exported cleanly. State that there''s possible data loss. - Check /var/adm/messages, or fma, and warn if there have been any messages related to drives attached to that pool. That would go a long way to mitigating the ''silent'' part of the data loss issue, although I''d also be tempted to have ZFS not automatically import pools that were not exported cleanly, forcing explicit action by the administrator. You could then have the above warning messages displayed by zpool import, making it absolutely clear what has happened. -- This message posted from opensolaris.org
Ross, this is a pretty good description of what I would expect when failmode=continue. What happens when failmode=panic? -- richard Ross wrote:> Ok, it''s still happening in snv_106: > > I plugged a USB drive into a freshly installed system, and created a single disk zpool on it: > # zpool create usbtest c1t0d0 > > I opened the (nautilus?) file manager in gnome, and copied the /etc/X11 folder to it. I then copied the /etc/apache folder to it, and at 4:05pm, disconnected the drive. > > At this point there are *no* warnings on screen, or any indication that there is a problem. To check that the pool was still working, I created duplicates of the two folders on that drive. That worked without any errors, although the drive was physically removed. > > 4:07pm > I ran zpool status, the pool is actually showing as unavailable, so at least that has happened faster than my last test. > > The folder is still open in gnome, however any attempt to copy files to or from it just hangs the file transfer operation window. > > 4:09pm > /usbtest is still visible in gnome > Also, I can still open a console and use the folder: > > # cd usbtest > # ls > X11 X11 (copy) apache apache (copy) > > I also tried: > # mv X11 X11-test > > That hung, but I saw the X11 folder disappear from the graphical file manager, so the system still believes something is working with this pool. > > The main GUI is actually a little messed up now. The gnome file manager window looking at the /usbtest folder has hung. Also, right-clicking the desktop to open a new terminal hangs, leaving the right-click menu on screen. > > The main menu still works though, and I can still open a new terminal. > > 4:19pm > Commands such as ls are finally hanging on the pool. > > At this point I tried to reboot, but it appears that isn''t working. I used system monitor to kill everything I had running and tried again, but that didn''t help. > > I had to physically power off the system to reboot. > > After the reboot, as expected, /usbtest still exists (even though the drive is disconnected). I removed that folder and connected the drive. > > ZFS detects the insertion and automounts the drive, but I find that although the pool is showing as online, and the filesystem shows as mounted at /usbtest. But the /usbtest directory doesn''t exist. > > I had to export and import the pool to get it available, but as expected, I''ve lost data: > # cd usbtest > # ls > X11 > > even worse, zfs is completely unaware of this: > # zpool status -v usbtest > pool: usbtest > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > usbtest ONLINE 0 0 0 > c1t0d0 ONLINE 0 0 0 > > errors: No known data errors > > > So in summary, there are a good few problems here, many of which I''ve already reported as bugs: > > 1. ZFS still accepts read and write operations for a faulted pool, causing data loss that isn''t necessarily reported by zpool status. > 2. Even after writes start to hang, it''s still possible to continue reading data from a faulted pool. > 3. A faulted pool causes unwanted side effects in the GUI, making the system hard to use, and impossible to reboot. > 4. After a hard reset, ZFS does not recover cleanly. Unused mountpoints are left behind. > 5. Automatic mounting of pools doesn''t seem to work reliably. > 6. zfs status doesn''t inform of any problems mounting the pool. >
I can check on Monday, but the system will probably panic... which doesn''t really help :-) Am I right in thinking failmode=wait is still the default? If so, that should be how it''s set as this testing was done on a clean install of snv_106. From what I''ve seen, I don''t think this is a problem with the zfs failmode. It''s more of an issue of what happens in the period *before* zfs realises there''s a problem and applies the failmode. This time there was just a window of a couple of minutes while commands would continue. In the past I''ve managed to stretch it out to hours. To me the biggest problems are: - ZFS accepting writes that don''t happen (from both before and after the drive is removed) - No logging or warning of this in zpool status I appreciate that if you''re using cache, some data loss is pretty much inevitable when a pool fails, but that should be a few seconds worth of data at worst, not minutes or hours worth. Also, if a pool fails completely and there''s data in the cache that hasn''t been committed to disk, it would be great if Solaris could respond by: - immediately dumping the cache to any (all?) working storage - prompting the user to fix the pool, or save the cache before powering down the system Ross On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling <richard.elling at gmail.com> wrote:> Ross, this is a pretty good description of what I would expect when > failmode=continue. What happens when failmode=panic? > -- richard > > > Ross wrote: >> >> Ok, it''s still happening in snv_106: >> >> I plugged a USB drive into a freshly installed system, and created a >> single disk zpool on it: >> # zpool create usbtest c1t0d0 >> >> I opened the (nautilus?) file manager in gnome, and copied the /etc/X11 >> folder to it. I then copied the /etc/apache folder to it, and at 4:05pm, >> disconnected the drive. >> >> At this point there are *no* warnings on screen, or any indication that >> there is a problem. To check that the pool was still working, I created >> duplicates of the two folders on that drive. That worked without any >> errors, although the drive was physically removed. >> >> 4:07pm >> I ran zpool status, the pool is actually showing as unavailable, so at >> least that has happened faster than my last test. >> >> The folder is still open in gnome, however any attempt to copy files to or >> from it just hangs the file transfer operation window. >> >> 4:09pm >> /usbtest is still visible in gnome >> Also, I can still open a console and use the folder: >> >> # cd usbtest >> # ls >> X11 X11 (copy) apache apache (copy) >> >> I also tried: >> # mv X11 X11-test >> >> That hung, but I saw the X11 folder disappear from the graphical file >> manager, so the system still believes something is working with this pool. >> >> The main GUI is actually a little messed up now. The gnome file manager >> window looking at the /usbtest folder has hung. Also, right-clicking the >> desktop to open a new terminal hangs, leaving the right-click menu on >> screen. >> >> The main menu still works though, and I can still open a new terminal. >> >> 4:19pm >> Commands such as ls are finally hanging on the pool. >> >> At this point I tried to reboot, but it appears that isn''t working. I >> used system monitor to kill everything I had running and tried again, but >> that didn''t help. >> >> I had to physically power off the system to reboot. >> >> After the reboot, as expected, /usbtest still exists (even though the >> drive is disconnected). I removed that folder and connected the drive. >> >> ZFS detects the insertion and automounts the drive, but I find that >> although the pool is showing as online, and the filesystem shows as mounted >> at /usbtest. But the /usbtest directory doesn''t exist. >> >> I had to export and import the pool to get it available, but as expected, >> I''ve lost data: >> # cd usbtest >> # ls >> X11 >> >> even worse, zfs is completely unaware of this: >> # zpool status -v usbtest >> pool: usbtest >> state: ONLINE >> scrub: none requested >> config: >> >> NAME STATE READ WRITE CKSUM >> usbtest ONLINE 0 0 0 >> c1t0d0 ONLINE 0 0 0 >> >> errors: No known data errors >> >> >> So in summary, there are a good few problems here, many of which I''ve >> already reported as bugs: >> >> 1. ZFS still accepts read and write operations for a faulted pool, causing >> data loss that isn''t necessarily reported by zpool status. >> 2. Even after writes start to hang, it''s still possible to continue >> reading data from a faulted pool. >> 3. A faulted pool causes unwanted side effects in the GUI, making the >> system hard to use, and impossible to reboot. >> 4. After a hard reset, ZFS does not recover cleanly. Unused mountpoints >> are left behind. >> 5. Automatic mounting of pools doesn''t seem to work reliably. >> 6. zfs status doesn''t inform of any problems mounting the pool. >> > >
On Fri, Feb 6, 2009 at 10:50 AM, Ross Smith <myxiplx at googlemail.com> wrote:> I can check on Monday, but the system will probably panic... which > doesn''t really help :-) > > Am I right in thinking failmode=wait is still the default? If so, > that should be how it''s set as this testing was done on a clean > install of snv_106. From what I''ve seen, I don''t think this is a > problem with the zfs failmode. It''s more of an issue of what happens > in the period *before* zfs realises there''s a problem and applies the > failmode. > > This time there was just a window of a couple of minutes while > commands would continue. In the past I''ve managed to stretch it out > to hours. > > To me the biggest problems are: > - ZFS accepting writes that don''t happen (from both before and after > the drive is removed) > - No logging or warning of this in zpool status > > I appreciate that if you''re using cache, some data loss is pretty much > inevitable when a pool fails, but that should be a few seconds worth > of data at worst, not minutes or hours worth. > > Also, if a pool fails completely and there''s data in the cache that > hasn''t been committed to disk, it would be great if Solaris could > respond by: > > - immediately dumping the cache to any (all?) working storage > - prompting the user to fix the pool, or save the cache before > powering down the system > > Ross > > > On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling <richard.elling at gmail.com> wrote: >> Ross, this is a pretty good description of what I would expect when >> failmode=continue. What happens when failmode=panic? >> -- richard >> >> >> Ross wrote: >>> >>> Ok, it''s still happening in snv_106: >>> >>> I plugged a USB drive into a freshly installed system, and created a >>> single disk zpool on it: >>> # zpool create usbtest c1t0d0 >>> >>> I opened the (nautilus?) file manager in gnome, and copied the /etc/X11 >>> folder to it. I then copied the /etc/apache folder to it, and at 4:05pm, >>> disconnected the drive. >>> >>> At this point there are *no* warnings on screen, or any indication that >>> there is a problem. To check that the pool was still working, I created >>> duplicates of the two folders on that drive. That worked without any >>> errors, although the drive was physically removed. >>> >>> 4:07pm >>> I ran zpool status, the pool is actually showing as unavailable, so at >>> least that has happened faster than my last test. >>> >>> The folder is still open in gnome, however any attempt to copy files to or >>> from it just hangs the file transfer operation window. >>> >>> 4:09pm >>> /usbtest is still visible in gnome >>> Also, I can still open a console and use the folder: >>> >>> # cd usbtest >>> # ls >>> X11 X11 (copy) apache apache (copy) >>> >>> I also tried: >>> # mv X11 X11-test >>> >>> That hung, but I saw the X11 folder disappear from the graphical file >>> manager, so the system still believes something is working with this pool. >>> >>> The main GUI is actually a little messed up now. The gnome file manager >>> window looking at the /usbtest folder has hung. Also, right-clicking the >>> desktop to open a new terminal hangs, leaving the right-click menu on >>> screen. >>> >>> The main menu still works though, and I can still open a new terminal. >>> >>> 4:19pm >>> Commands such as ls are finally hanging on the pool. >>> >>> At this point I tried to reboot, but it appears that isn''t working. I >>> used system monitor to kill everything I had running and tried again, but >>> that didn''t help. >>> >>> I had to physically power off the system to reboot. >>> >>> After the reboot, as expected, /usbtest still exists (even though the >>> drive is disconnected). I removed that folder and connected the drive. >>> >>> ZFS detects the insertion and automounts the drive, but I find that >>> although the pool is showing as online, and the filesystem shows as mounted >>> at /usbtest. But the /usbtest directory doesn''t exist. >>> >>> I had to export and import the pool to get it available, but as expected, >>> I''ve lost data: >>> # cd usbtest >>> # ls >>> X11 >>> >>> even worse, zfs is completely unaware of this: >>> # zpool status -v usbtest >>> pool: usbtest >>> state: ONLINE >>> scrub: none requested >>> config: >>> >>> NAME STATE READ WRITE CKSUM >>> usbtest ONLINE 0 0 0 >>> c1t0d0 ONLINE 0 0 0 >>> >>> errors: No known data errors >>> >>> >>> So in summary, there are a good few problems here, many of which I''ve >>> already reported as bugs: >>> >>> 1. ZFS still accepts read and write operations for a faulted pool, causing >>> data loss that isn''t necessarily reported by zpool status. >>> 2. Even after writes start to hang, it''s still possible to continue >>> reading data from a faulted pool. >>> 3. A faulted pool causes unwanted side effects in the GUI, making the >>> system hard to use, and impossible to reboot. >>> 4. After a hard reset, ZFS does not recover cleanly. Unused mountpoints >>> are left behind. >>> 5. Automatic mounting of pools doesn''t seem to work reliably. >>> 6. zfs status doesn''t inform of any problems mounting the pool. >>> >> >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Could this be related to the ZFS TXG/transfer group buffers? ie. it''ll buffer writes for a bit before committing to disk. Then, when its time to commit to disk, it realizes the disk is failed, and from then enter those failmode conditions (wait, continue, panic, ?). Could this be the case? http://blogs.sun.com/roch/date/20080514 -- Brent Jones brent at servuhome.net
Something to do with cache was my first thought. It seems to be able to read and write from the cache quite happily for some time, regardless of whether the pool is live. If you''re reading or writing large amounts of data, zfs starts experiencing IO faults and offlines the pool pretty quickly. If you''re just working with small datasets, or viewing files that you''ve recently opened, it seems you can stretch it out for quite a while. But yes, it seems that it doesn''t enter failmode until the cache is full. I would expect it to hit this within 5 seconds (since I believe that is how often the cache should be writing). On Fri, Feb 6, 2009 at 7:04 PM, Brent Jones <brent at servuhome.net> wrote:> On Fri, Feb 6, 2009 at 10:50 AM, Ross Smith <myxiplx at googlemail.com> wrote: >> I can check on Monday, but the system will probably panic... which >> doesn''t really help :-) >> >> Am I right in thinking failmode=wait is still the default? If so, >> that should be how it''s set as this testing was done on a clean >> install of snv_106. From what I''ve seen, I don''t think this is a >> problem with the zfs failmode. It''s more of an issue of what happens >> in the period *before* zfs realises there''s a problem and applies the >> failmode. >> >> This time there was just a window of a couple of minutes while >> commands would continue. In the past I''ve managed to stretch it out >> to hours. >> >> To me the biggest problems are: >> - ZFS accepting writes that don''t happen (from both before and after >> the drive is removed) >> - No logging or warning of this in zpool status >> >> I appreciate that if you''re using cache, some data loss is pretty much >> inevitable when a pool fails, but that should be a few seconds worth >> of data at worst, not minutes or hours worth. >> >> Also, if a pool fails completely and there''s data in the cache that >> hasn''t been committed to disk, it would be great if Solaris could >> respond by: >> >> - immediately dumping the cache to any (all?) working storage >> - prompting the user to fix the pool, or save the cache before >> powering down the system >> >> Ross >> >> >> On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling <richard.elling at gmail.com> wrote: >>> Ross, this is a pretty good description of what I would expect when >>> failmode=continue. What happens when failmode=panic? >>> -- richard >>> >>> >>> Ross wrote: >>>> >>>> Ok, it''s still happening in snv_106: >>>> >>>> I plugged a USB drive into a freshly installed system, and created a >>>> single disk zpool on it: >>>> # zpool create usbtest c1t0d0 >>>> >>>> I opened the (nautilus?) file manager in gnome, and copied the /etc/X11 >>>> folder to it. I then copied the /etc/apache folder to it, and at 4:05pm, >>>> disconnected the drive. >>>> >>>> At this point there are *no* warnings on screen, or any indication that >>>> there is a problem. To check that the pool was still working, I created >>>> duplicates of the two folders on that drive. That worked without any >>>> errors, although the drive was physically removed. >>>> >>>> 4:07pm >>>> I ran zpool status, the pool is actually showing as unavailable, so at >>>> least that has happened faster than my last test. >>>> >>>> The folder is still open in gnome, however any attempt to copy files to or >>>> from it just hangs the file transfer operation window. >>>> >>>> 4:09pm >>>> /usbtest is still visible in gnome >>>> Also, I can still open a console and use the folder: >>>> >>>> # cd usbtest >>>> # ls >>>> X11 X11 (copy) apache apache (copy) >>>> >>>> I also tried: >>>> # mv X11 X11-test >>>> >>>> That hung, but I saw the X11 folder disappear from the graphical file >>>> manager, so the system still believes something is working with this pool. >>>> >>>> The main GUI is actually a little messed up now. The gnome file manager >>>> window looking at the /usbtest folder has hung. Also, right-clicking the >>>> desktop to open a new terminal hangs, leaving the right-click menu on >>>> screen. >>>> >>>> The main menu still works though, and I can still open a new terminal. >>>> >>>> 4:19pm >>>> Commands such as ls are finally hanging on the pool. >>>> >>>> At this point I tried to reboot, but it appears that isn''t working. I >>>> used system monitor to kill everything I had running and tried again, but >>>> that didn''t help. >>>> >>>> I had to physically power off the system to reboot. >>>> >>>> After the reboot, as expected, /usbtest still exists (even though the >>>> drive is disconnected). I removed that folder and connected the drive. >>>> >>>> ZFS detects the insertion and automounts the drive, but I find that >>>> although the pool is showing as online, and the filesystem shows as mounted >>>> at /usbtest. But the /usbtest directory doesn''t exist. >>>> >>>> I had to export and import the pool to get it available, but as expected, >>>> I''ve lost data: >>>> # cd usbtest >>>> # ls >>>> X11 >>>> >>>> even worse, zfs is completely unaware of this: >>>> # zpool status -v usbtest >>>> pool: usbtest >>>> state: ONLINE >>>> scrub: none requested >>>> config: >>>> >>>> NAME STATE READ WRITE CKSUM >>>> usbtest ONLINE 0 0 0 >>>> c1t0d0 ONLINE 0 0 0 >>>> >>>> errors: No known data errors >>>> >>>> >>>> So in summary, there are a good few problems here, many of which I''ve >>>> already reported as bugs: >>>> >>>> 1. ZFS still accepts read and write operations for a faulted pool, causing >>>> data loss that isn''t necessarily reported by zpool status. >>>> 2. Even after writes start to hang, it''s still possible to continue >>>> reading data from a faulted pool. >>>> 3. A faulted pool causes unwanted side effects in the GUI, making the >>>> system hard to use, and impossible to reboot. >>>> 4. After a hard reset, ZFS does not recover cleanly. Unused mountpoints >>>> are left behind. >>>> 5. Automatic mounting of pools doesn''t seem to work reliably. >>>> 6. zfs status doesn''t inform of any problems mounting the pool. >>>> >>> >>> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > Could this be related to the ZFS TXG/transfer group buffers? > > ie. it''ll buffer writes for a bit before committing to disk. Then, > when its time to commit to disk, it realizes the disk is failed, and > from then enter those failmode conditions (wait, continue, panic, ?). > Could this be the case? > > http://blogs.sun.com/roch/date/20080514 > > > -- > Brent Jones > brent at servuhome.net >
Le 6 f?vr. 09 ? 20:54, Ross Smith a ?crit :> Something to do with cache was my first thought. It seems to be able > to read and write from the cache quite happily for some time, > regardless of whether the pool is live. > > If you''re reading or writing large amounts of data, zfs starts > experiencing IO faults and offlines the pool pretty quickly. If > you''re just working with small datasets, or viewing files that you''ve > recently opened, it seems you can stretch it out for quite a while. > > But yes, it seems that it doesn''t enter failmode until the cache is > full. I would expect it to hit this within 5 seconds (since I believe > that is how often the cache should be writing). >Note that on an lightly loaded system , it''s more 30 sec these days. -r> > On Fri, Feb 6, 2009 at 7:04 PM, Brent Jones <brent at servuhome.net> > wrote: >> On Fri, Feb 6, 2009 at 10:50 AM, Ross Smith >> <myxiplx at googlemail.com> wrote: >>> I can check on Monday, but the system will probably panic... which >>> doesn''t really help :-) >>> >>> Am I right in thinking failmode=wait is still the default? If so, >>> that should be how it''s set as this testing was done on a clean >>> install of snv_106. From what I''ve seen, I don''t think this is a >>> problem with the zfs failmode. It''s more of an issue of what >>> happens >>> in the period *before* zfs realises there''s a problem and applies >>> the >>> failmode. >>> >>> This time there was just a window of a couple of minutes while >>> commands would continue. In the past I''ve managed to stretch it out >>> to hours. >>> >>> To me the biggest problems are: >>> - ZFS accepting writes that don''t happen (from both before and after >>> the drive is removed) >>> - No logging or warning of this in zpool status >>> >>> I appreciate that if you''re using cache, some data loss is pretty >>> much >>> inevitable when a pool fails, but that should be a few seconds worth >>> of data at worst, not minutes or hours worth. >>> >>> Also, if a pool fails completely and there''s data in the cache that >>> hasn''t been committed to disk, it would be great if Solaris could >>> respond by: >>> >>> - immediately dumping the cache to any (all?) working storage >>> - prompting the user to fix the pool, or save the cache before >>> powering down the system >>> >>> Ross >>> >>> >>> On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling <richard.elling at gmail.com >>> > wrote: >>>> Ross, this is a pretty good description of what I would expect when >>>> failmode=continue. What happens when failmode=panic? >>>> -- richard >>>> >>>> >>>> Ross wrote: >>>>> >>>>> Ok, it''s still happening in snv_106: >>>>> >>>>> I plugged a USB drive into a freshly installed system, and >>>>> created a >>>>> single disk zpool on it: >>>>> # zpool create usbtest c1t0d0 >>>>> >>>>> I opened the (nautilus?) file manager in gnome, and copied the / >>>>> etc/X11 >>>>> folder to it. I then copied the /etc/apache folder to it, and >>>>> at 4:05pm, >>>>> disconnected the drive. >>>>> >>>>> At this point there are *no* warnings on screen, or any >>>>> indication that >>>>> there is a problem. To check that the pool was still working, I >>>>> created >>>>> duplicates of the two folders on that drive. That worked >>>>> without any >>>>> errors, although the drive was physically removed. >>>>> >>>>> 4:07pm >>>>> I ran zpool status, the pool is actually showing as unavailable, >>>>> so at >>>>> least that has happened faster than my last test. >>>>> >>>>> The folder is still open in gnome, however any attempt to copy >>>>> files to or >>>>> from it just hangs the file transfer operation window. >>>>> >>>>> 4:09pm >>>>> /usbtest is still visible in gnome >>>>> Also, I can still open a console and use the folder: >>>>> >>>>> # cd usbtest >>>>> # ls >>>>> X11 X11 (copy) apache apache (copy) >>>>> >>>>> I also tried: >>>>> # mv X11 X11-test >>>>> >>>>> That hung, but I saw the X11 folder disappear from the graphical >>>>> file >>>>> manager, so the system still believes something is working with >>>>> this pool. >>>>> >>>>> The main GUI is actually a little messed up now. The gnome file >>>>> manager >>>>> window looking at the /usbtest folder has hung. Also, right- >>>>> clicking the >>>>> desktop to open a new terminal hangs, leaving the right-click >>>>> menu on >>>>> screen. >>>>> >>>>> The main menu still works though, and I can still open a new >>>>> terminal. >>>>> >>>>> 4:19pm >>>>> Commands such as ls are finally hanging on the pool. >>>>> >>>>> At this point I tried to reboot, but it appears that isn''t >>>>> working. I >>>>> used system monitor to kill everything I had running and tried >>>>> again, but >>>>> that didn''t help. >>>>> >>>>> I had to physically power off the system to reboot. >>>>> >>>>> After the reboot, as expected, /usbtest still exists (even >>>>> though the >>>>> drive is disconnected). I removed that folder and connected the >>>>> drive. >>>>> >>>>> ZFS detects the insertion and automounts the drive, but I find >>>>> that >>>>> although the pool is showing as online, and the filesystem shows >>>>> as mounted >>>>> at /usbtest. But the /usbtest directory doesn''t exist. >>>>> >>>>> I had to export and import the pool to get it available, but as >>>>> expected, >>>>> I''ve lost data: >>>>> # cd usbtest >>>>> # ls >>>>> X11 >>>>> >>>>> even worse, zfs is completely unaware of this: >>>>> # zpool status -v usbtest >>>>> pool: usbtest >>>>> state: ONLINE >>>>> scrub: none requested >>>>> config: >>>>> >>>>> NAME STATE READ WRITE CKSUM >>>>> usbtest ONLINE 0 0 0 >>>>> c1t0d0 ONLINE 0 0 0 >>>>> >>>>> errors: No known data errors >>>>> >>>>> >>>>> So in summary, there are a good few problems here, many of which >>>>> I''ve >>>>> already reported as bugs: >>>>> >>>>> 1. ZFS still accepts read and write operations for a faulted >>>>> pool, causing >>>>> data loss that isn''t necessarily reported by zpool status. >>>>> 2. Even after writes start to hang, it''s still possible to >>>>> continue >>>>> reading data from a faulted pool. >>>>> 3. A faulted pool causes unwanted side effects in the GUI, >>>>> making the >>>>> system hard to use, and impossible to reboot. >>>>> 4. After a hard reset, ZFS does not recover cleanly. Unused >>>>> mountpoints >>>>> are left behind. >>>>> 5. Automatic mounting of pools doesn''t seem to work reliably. >>>>> 6. zfs status doesn''t inform of any problems mounting the pool. >>>>> >>>> >>>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >> >> Could this be related to the ZFS TXG/transfer group buffers? >> >> ie. it''ll buffer writes for a bit before committing to disk. Then, >> when its time to commit to disk, it realizes the disk is failed, and >> from then enter those failmode conditions (wait, continue, panic, ?). >> Could this be the case? >> >> http://blogs.sun.com/roch/date/20080514 >> >> >> -- >> Brent Jones >> brent at servuhome.net >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss