Hi all, First, I''ll say my intent is not to spam a bunch of lists, but after posting to opensolaris-discuss I had someone communicate with me offline that these lists would possibly be a better place to start. So here we are. For those on all three lists, sorry for the repetition. Second, this message is meant to solicit help in diagnosing the issue described below. Any hints on how DTrace may help, or where in general to start would be much appreciated. Back to the subject at hand. ----------------------- I''m testing an application which makes use of a large file mmap''d into memory, as if it the application was using malloc(). The file is roughly 2x the size of physical ram. Basically, I''m seeing the system stall for long periods of time, 60+ seconds, and then resume. The file lives on an SSD (Intel x25-e) and I''m using zfs''s lzjb compression to make more efficient use of the ~30G of space provided by that SSD. The general flow of things is, start application, ask it to use a 50G file. The file is created in a sparse manner at the location designated, then mmap is called on the entire file. All fine up to this point. I then start loading data into the application, and it starts pushing data to the file as you''d expect. Data is pushed to the file early and often, as it''s mmap''d with the MAP_SHARED flag. But, when the application''s resident size reaches about 80% of the physical ram on the system, the system starts paging and things are still working relatively well, though slower, as expected. Soon after, when reaching about 40G of data, I get stalls accessing the SSD (according to iostat), in other words, no IO to that drive. When I started looking into what could be causing it, such as IO timeouts, I run dmesg and it hangs after printing a timestamp. I can ctrl-c dmesg, but subsequent runs provide no better results. I see no new messages in /var/adm/messages, as I''d expect. Eventually the system recovers, the latest case took over 10 minutes to recover, after killing the application mentioned above, and I do see disk timeouts in dmesg. So, I can only assume that there''s either a driver bug in the SATA/SAS controller I''m using and it''s throwing timeouts, or the SSD is having issues. Looking at the zpool configuration, I see that failmode=wait, and since that SSD is the only member of the zpool I would expect IO to hang. But, does that mean that dmesg should hang also? Does that mean that the kernel has at least one thread stuck? Would failmode=continue be more desired, or resilient? During the hang, load-avg is artificially high, fmd being the one process that sticks out in prstat output. But fmdump -v doesn''t show anything relevant. Anyone have ideas on how to diagnose what''s going on there? Thanks, Ethan System: Sun x4240 dual-amd2347, 32G of ram SAS/SATA Controller: LSI3081E OS: osol snv_98 SSD: Intel x25-e
I''m no expert, but the first thing I''d ask is whether you could repeat that test without using compression? I''d be quite worried about how a system is going to perform when it''s basically running off a 50GB compressed file. There seem to be a lot of variables here, on quite a few new systems, and I''d want to try to focus on individual subsystems to ensure they''re all working to spec fist. If you can try without compression, the next thing I''d want to test is to try with a file that''s smaller than RAM. That should hopefully eliminate paging or RAM pressure, and allow you to verify that the storage subsystem is performing as expected. You could then re-enable compression and see if it still works. -- This message posted from opensolaris.org
Ethan Erchinger wrote:> Hi all, > > First, I''ll say my intent is not to spam a bunch of lists, but after > posting to opensolaris-discuss I had someone communicate with me offline > that these lists would possibly be a better place to start. So here we > are. For those on all three lists, sorry for the repetition. > > Second, this message is meant to solicit help in diagnosing the issue > described below. Any hints on how DTrace may help, or where in general > to start would be much appreciated. Back to the subject at hand. > > ----------------------- > > I''m testing an application which makes use of a large file mmap''d into > memory, as if it the application was using malloc(). The file is > roughly 2x the size of physical ram. Basically, I''m seeing the system > stall for long periods of time, 60+ seconds, and then resume. The file > lives on an SSD (Intel x25-e) and I''m using zfs''s lzjb compression to > make more efficient use of the ~30G of space provided by that SSD. > > The general flow of things is, start application, ask it to use a 50G > file. The file is created in a sparse manner at the location > designated, then mmap is called on the entire file. All fine up to this > point. > > I then start loading data into the application, and it starts pushing > data to the file as you''d expect. Data is pushed to the file early and > often, as it''s mmap''d with the MAP_SHARED flag. But, when the > application''s resident size reaches about 80% of the physical ram on the > system, the system starts paging and things are still working relatively > well, though slower, as expected. > > Soon after, when reaching about 40G of data, I get stalls accessing the > SSD (according to iostat), in other words, no IO to that drive. When I > started looking into what could be causing it, such as IO timeouts, I > run dmesg and it hangs after printing a timestamp. I can ctrl-c dmesg, > but subsequent runs provide no better results. I see no new messages in > /var/adm/messages, as I''d expect. > > Eventually the system recovers, the latest case took over 10 minutes to > recover, after killing the application mentioned above, and I do see > disk timeouts in dmesg. > > So, I can only assume that there''s either a driver bug in the SATA/SAS > controller I''m using and it''s throwing timeouts, or the SSD is having > issues. Looking at the zpool configuration, I see that failmode=wait, > and since that SSD is the only member of the zpool I would expect IO to > hang. > > But, does that mean that dmesg should hang also? Does that mean that > the kernel has at least one thread stuck? Would failmode=continue be > more desired, or resilient? > > During the hang, load-avg is artificially high, fmd being the one > process that sticks out in prstat output. But fmdump -v doesn''t show > anything relevant. >I''ve seen these symptoms when a large number of errors were reported in a short period of time and memory was low. What does "fmdump -eV" show? Also, it would help to know what OS release you are using. -- richard> Anyone have ideas on how to diagnose what''s going on there? > > Thanks, > Ethan > > System: Sun x4240 dual-amd2347, 32G of ram > SAS/SATA Controller: LSI3081E > OS: osol snv_98 > SSD: Intel x25-e > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Richard Elling wrote:> > I''ve seen these symptoms when a large number of errors were reported > in a short period of time and memory was low. What does "fmdump -eV" > show? >fmdump -eV shows lots of messages like this, and yea, I believe that to be sd16 which is the SSD: Dec 03 2008 08:31:11.224690595 ereport.io.scsi.cmd.disk.dev.rqs.derr nvlist version: 0 class = ereport.io.scsi.cmd.disk.dev.rqs.derr ena = 0x76bbfe8ef9000c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0/sd at f,0 devid = id1,sd at n5001517958702bc7 (end detector) driver-assessment = retry op-code = 0x2a cdb = 0x2a 0x0 0x1 0xe8 0x20 0x88 0x0 0x0 0x3 0x0 pkt-reason = 0x0 pkt-state = 0x37 pkt-stats = 0x0 stat-code = 0x2 key = 0x6 asc = 0x29 ascq = 0x0 sense-data = 0x70 0x0 0x6 0x0 0x0 0x0 0x0 0xa 0x0 0x0 0x0 0x0 0x29 0x0 0x0 0x0 0x0 0x0 0x0 0x0 __ttl = 0x1 __tod = 0x4936b44f 0xd6481a3 Dec 03 2008 08:31:11.224690595 ereport.io.scsi.cmd.disk.recovered nvlist version: 0 class = ereport.io.scsi.cmd.disk.recovered ena = 0x76bbfe8ef9000c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0/sd at f,0 devid = id1,sd at n5001517958702bc7 (end detector) driver-assessment = recovered op-code = 0x2a cdb = 0x2a 0x0 0x1 0xe8 0x20 0x88 0x0 0x0 0x3 0x0 pkt-reason = 0x0 pkt-state = 0x1f pkt-stats = 0x0 __ttl = 0x1 __tod = 0x4936b44f 0xd6481a3> Also, it would help to know what OS release you are using.Oh, it was at the bottom of my original message, but I just realized that I''m actually running snv_99, forgot that I had performed an image-update a while back.
Ross wrote:> I''m no expert, but the first thing I''d ask is whether you could repeat that test without using compression? I''d be quite worried about how a system is going to perform when it''s basically running off a 50GB compressed file. > >Yes this does occur with compression off, but it''s harder to repeat. I think that''s only due to the fact that physical memory happens to closely match the size of the SSD though. 32G of ram, 32G SSD, both with some overhead taking the respectable levels to 26-28G.> There seem to be a lot of variables here, on quite a few new systems, and I''d want to try to focus on individual subsystems to ensure they''re all working to spec fist. >New systems? SSD yes. But the application is very well tested, although using malloc(), not mmap(). Compression isn''t new, lots of people use it. I''ve run this system, running MySQL, through many many tests and it''s worked quite nicely. Many variables, yes, lots of new code paths, I''d hope not.> If you can try without compression, the next thing I''d want to test is to try with a file that''s smaller than RAM. That should hopefully eliminate paging or RAM pressure, and allow you to verify that the storage subsystem is performing as expected. You could then re-enable compression and see if it still works. >Running with a file smaller than ram does not have this issue. As a test I''ve run things a little differently. I put the SSD as swap and used malloc() again, but allocate a large dataset > physical memory, and I''ve seen very similar stalls. There is 0 IO, but the application is blocked on something. I guess I should try to insert to some debug code, or use dtruss to see if the application is waiting on a syscall.
On Thu, Dec 4, 2008 at 11:55 AM, Ethan Erchinger <ethan at plaxo.com> wrote:> > > Ross wrote: > > I''m no expert, but the first thing I''d ask is whether you could repeat > that test without using compression? I''d be quite worried about how a > system is going to perform when it''s basically running off a 50GB compressed > file. > > > > > Yes this does occur with compression off, but it''s harder to repeat. I > think that''s only due to the fact that physical memory happens to > closely match the size of the SSD though. 32G of ram, 32G SSD, both > with some overhead taking the respectable levels to 26-28G. > > There seem to be a lot of variables here, on quite a few new systems, and > I''d want to try to focus on individual subsystems to ensure they''re all > working to spec fist. > > > New systems? SSD yes. But the application is very well tested, > although using malloc(), not mmap(). Compression isn''t new, lots of > people use it. I''ve run this system, running MySQL, through many many > tests and it''s worked quite nicely. Many variables, yes, lots of new > code paths, I''d hope not. > > If you can try without compression, the next thing I''d want to test is to > try with a file that''s smaller than RAM. That should hopefully eliminate > paging or RAM pressure, and allow you to verify that the storage subsystem > is performing as expected. You could then re-enable compression and see if > it still works. > > > Running with a file smaller than ram does not have this issue. As a > test I''ve run things a little differently. I put the SSD as swap and > used malloc() again, but allocate a large dataset > physical memory, and > I''ve seen very similar stalls. There is 0 IO, but the application is > blocked on something. I guess I should try to insert to some debug > code, or use dtruss to see if the application is waiting on a syscall. >Are you leaving ANY ram for zfs to do it''s thing? If you''re consuming ALL system memory for just this file/application, I would expect the system to fall over and die. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081204/a9193b3a/attachment.html>
Tim wrote:> > > Are you leaving ANY ram for zfs to do it''s thing? If you''re consuming > ALL system memory for just this file/application, I would expect the > system to fall over and die. >Hmm. I believe that the kernel should manage that relationship for me. If the system cannot manage swap or paging on large mmap() files, there is something very wrong with the virtual memory manager in the kernel.
Ethan Erchinger wrote:> > > Richard Elling wrote: >> >> I''ve seen these symptoms when a large number of errors were reported >> in a short period of time and memory was low. What does "fmdump -eV" >> show? >> > fmdump -eV shows lots of messages like this, and yea, I believe that > to be sd16 which is the SSD: > > Dec 03 2008 08:31:11.224690595 ereport.io.scsi.cmd.disk.dev.rqs.derr > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.dev.rqs.derr > ena = 0x76bbfe8ef9000c01 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0/sd at f,0 > devid = id1,sd at n5001517958702bc7 > (end detector) > > driver-assessment = retryWe retry, which is why there is a recovery e-report later.> op-code = 0x2a > cdb = 0x2a 0x0 0x1 0xe8 0x20 0x88 0x0 0x0 0x3 0x0 > pkt-reason = 0x0 > pkt-state = 0x37 > pkt-stats = 0x0 > stat-code = 0x2 > key = 0x6 > asc = 0x29 > ascq = 0x0ASC/ASCQ 29/00 is POWER ON, RESET, OR BUS DEVICE RESET OCCURRED http://www.t10.org/lists/asc-num.htm#ASC_29 [this should be more descriptive as the codes are, more-or-less, standardized, I''ll try to file an RFE, unless someone beat me to it] Depending on which system did the reset, it should be noted in the /var/adm/messages log. This makes me suspect the hardware (firmware, actually). -- richard> sense-data = 0x70 0x0 0x6 0x0 0x0 0x0 0x0 0xa 0x0 0x0 0x0 0x0 > 0x29 0x0 0x0 0x0 0x0 0x0 0x0 0x0 > __ttl = 0x1 > __tod = 0x4936b44f 0xd6481a3 > > Dec 03 2008 08:31:11.224690595 ereport.io.scsi.cmd.disk.recovered > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.recovered > ena = 0x76bbfe8ef9000c01 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0/sd at f,0 > devid = id1,sd at n5001517958702bc7 > (end detector) > > driver-assessment = recovered > op-code = 0x2a > cdb = 0x2a 0x0 0x1 0xe8 0x20 0x88 0x0 0x0 0x3 0x0 > pkt-reason = 0x0 > pkt-state = 0x1f > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4936b44f 0xd6481a3 > >> Also, it would help to know what OS release you are using. > Oh, it was at the bottom of my original message, but I just realized > that I''m actually running snv_99, forgot that I had performed an > image-update a while back. >
Richard Elling wrote:>> >> asc = 0x29 >> ascq = 0x0 > > ASC/ASCQ 29/00 is POWER ON, RESET, OR BUS DEVICE RESET OCCURRED > http://www.t10.org/lists/asc-num.htm#ASC_29 > > [this should be more descriptive as the codes are, more-or-less, > standardized, I''ll try to file an RFE, unless someone beat me to it] > > Depending on which system did the reset, it should be noted in the > /var/adm/messages log. This makes me suspect the hardware (firmware, > actually). >Firmware of the SSD, or something else?
Ethan Erchinger wrote:> > Richard Elling wrote: >>> >>> asc = 0x29 >>> ascq = 0x0 >> >> ASC/ASCQ 29/00 is POWER ON, RESET, OR BUS DEVICE RESET OCCURRED >> http://www.t10.org/lists/asc-num.htm#ASC_29 >> >> [this should be more descriptive as the codes are, more-or-less, >> standardized, I''ll try to file an RFE, unless someone beat me to it] >> >> Depending on which system did the reset, it should be noted in the >> /var/adm/messages log. This makes me suspect the hardware (firmware, >> actually). >> > Firmware of the SSD, or something else?The answer may lie in the /var/adm/messages file which should report if a reset was received or sent. -- richard
Richard Elling wrote:> The answer may lie in the /var/adm/messages file which should report > if a reset was received or sent.Here is a sample set of messages at that time. It looks like timeouts on the SSD for various requested blocks. Maybe I need to talk with Intel about this issue. Ethan ========================= Dec 2 20:14:01 opensolaris scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0/sd at f,0 (sd16): Dec 2 20:14:01 opensolaris Error for Command: write Error Level: Retryable Dec 2 20:14:01 opensolaris scsi: [ID 107833 kern.notice] Requested Block: 840 Error Block: 840 Dec 2 20:14:01 opensolaris scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: CVEM840201EU Dec 2 20:14:01 opensolaris scsi: [ID 107833 kern.notice] Sense Key: Unit_Attention Dec 2 20:14:01 opensolaris scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Dec 2 20:15:08 opensolaris scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0 (mpt0): Dec 2 20:15:08 opensolaris Disconnected command timeout for Target 15 Dec 2 20:15:09 opensolaris scsi: [ID 365881 kern.info] /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0 (mpt0): Dec 2 20:15:09 opensolaris Log info 0x31140000 received for target 15. Dec 2 20:15:09 opensolaris scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Dec 2 20:15:09 opensolaris scsi: [ID 365881 kern.info] /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0 (mpt0): Dec 2 20:15:09 opensolaris Log info 0x31140000 received for target 15. Dec 2 20:15:09 opensolaris scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Dec 2 20:15:09 opensolaris scsi: [ID 365881 kern.info] /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0 (mpt0): Dec 2 20:15:09 opensolaris Log info 0x31140000 received for target 15. Dec 2 20:15:09 opensolaris scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Dec 2 20:15:09 opensolaris scsi: [ID 365881 kern.info] /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0 (mpt0): Dec 2 20:15:09 opensolaris Log info 0x31140000 received for target 15. Dec 2 20:15:09 opensolaris scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Dec 2 20:15:12 opensolaris scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0/sd at f,0 (sd16): Dec 2 20:15:12 opensolaris Error for Command: write Error Level: Retryable Dec 2 20:15:12 opensolaris scsi: [ID 107833 kern.notice] Requested Block: 810 Error Block: 810 Dec 2 20:15:12 opensolaris scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: CVEM840201EU Dec 2 20:15:12 opensolaris scsi: [ID 107833 kern.notice] Sense Key: Unit_Attention Dec 2 20:15:12 opensolaris scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Dec 2 20:16:19 opensolaris scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0 (mpt0): Dec 2 20:16:19 opensolaris Disconnected command timeout for Target 15 Dec 2 20:16:21 opensolaris scsi: [ID 365881 kern.info] /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0 (mpt0): Dec 2 20:16:21 opensolaris Log info 0x31140000 received for target 15. Dec 2 20:16:21 opensolaris scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Dec 2 20:16:21 opensolaris scsi: [ID 365881 kern.info] /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0 (mpt0): Dec 2 20:16:21 opensolaris Log info 0x31140000 received for target 15. Dec 2 20:16:21 opensolaris scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Dec 2 20:16:21 opensolaris scsi: [ID 365881 kern.info] /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0 (mpt0): Dec 2 20:16:21 opensolaris Log info 0x31140000 received for target 15. Dec 2 20:16:21 opensolaris scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Dec 2 20:16:21 opensolaris scsi: [ID 365881 kern.info] /pci at 0,0/pci10de,375 at f/pci1000,3150 at 0 (mpt0): Dec 2 20:16:21 opensolaris Log info 0x31140000 received for target 15. Dec 2 20:16:21 opensolaris scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Ethan Erchinger wrote:> Here is a sample set of messages at that time. It looks like timeouts > on the SSD for various requested blocks. Maybe I need to talk with > Intel about this issue. >Keeping everyone up-to-date, for those who care, I''ve RMAd the Intel drive, and will retest when the replacement arrives. I''m working under the assumption that I have a bad drive. Ethan