Is there a method to view the status of the rams ecc single or double bit errors? I would like to confirm that ecc on my xeon e5520 and ecc ram are performing their role since memtest is ambiguous. I am running memory test on a p6t6 ws, e5520 xeon, 2gb samsung ecc modules and this is what is on the screen: Chipset: Core IMC (ECC : Detect / Correct) However, further down "ECC" is identified as being "off". Yet there is a column for "ECC Errs". I don''t know how to interpret this. Is ECC active or not? http://img535.imageshack.us/img535/3981/ecc.jpg -- This message posted from opensolaris.org
Casper.Dik at Sun.COM
2010-Mar-03 15:14 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
>Is there a method to view the status of the rams ecc single or double bit errors? I would like toconfirm that ecc on my xeon e5520 and ecc ram are performing their role since memtest is ambiguous.> > >I am running memory test on a p6t6 ws, e5520 xeon, 2gb samsung ecc modules and this is what is onthe screen:> >Chipset: Core IMC (ECC : Detect / Correct) > >However, further down "ECC" is identified as being "off". Yet there is a column for "ECC Errs". > >I don''t know how to interpret this. Is ECC active or not?Off but only disabled by memtest, I believe. You can enable it in the memtest menu. Casper
Tomas Ă–gren
2010-Mar-03 15:19 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
On 03 March, 2010 - Casper.Dik at Sun.COM sent me these 0,8K bytes:> > >Is there a method to view the status of the rams ecc single or double bit errors? I would like to > confirm that ecc on my xeon e5520 and ecc ram are performing their role since memtest is ambiguous. > > > > > >I am running memory test on a p6t6 ws, e5520 xeon, 2gb samsung ecc modules and this is what is on > the screen: > > > >Chipset: Core IMC (ECC : Detect / Correct) > > > >However, further down "ECC" is identified as being "off". Yet there is a column for "ECC Errs". > > > >I don''t know how to interpret this. Is ECC active or not? > > Off but only disabled by memtest, I believe.Memtest doesn''t want potential errors to be hidden by ECC, so it disables ECC to see them if they occur.> > You can enable it in the memtest menu. > > Casper > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss/Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Robert Milkowski
2010-Mar-03 16:31 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
On 03/03/2010 15:19, Tomas ?gren wrote:> > Memtest doesn''t want potential errors to be hidden by ECC, so it > disables ECC to see them if they occur. > >still it is valid question - is there a way under OS to check if ECC is disabled or enabled? -- Robert Milkowski http://milek.blogspot.com
Darren J Moffat
2010-Mar-03 16:33 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
Robert Milkowski wrote:> On 03/03/2010 15:19, Tomas ?gren wrote: >> >> Memtest doesn''t want potential errors to be hidden by ECC, so it >> disables ECC to see them if they occur. >> >> > > still it is valid question - is there a way under OS to check if ECC is > disabled or enabled?Maybe something in the output of smbios(1M) -- Darren J Moffat
Miles Nordin
2010-Mar-03 19:33 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
>>>>> "a" == ace <tojaktoty at gmail.com> writes:a> However, further down "ECC" is identified as being "off". Yet a> there is a column for "ECC Errs". a> I don''t know how to interpret this. Is ECC active or not? ``Short circuit a data line or preferably a parity bit data line on one of the DDR memory modules with ground for a short period. For example pin 49 (parity bit 2, ie. bit 66) and pin 51 (parity bit 3 bit 67) are fine. The pin between them is Ground. Count the pins from a DIMM where number 1 pin is marked. It is easy to stuff a lead to the DIMM socket into the holes next to the socket pins. I used 10 ohm resistor to make the probability of damage smaller.'''' -- http://hyvatti.iki.fi/~jaakko/sw/ The idea would be to attach this resistor and then look for an ECC error count incrementing somewhere. If the software is good it ought to continue functioning and identify the bad DIMM. likely? i dunno. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100303/87b7c5a6/attachment.bin>
Robert Milkowski
2010-Mar-03 22:22 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
On 03/03/2010 16:33, Darren J Moffat wrote:> Robert Milkowski wrote: >> On 03/03/2010 15:19, Tomas ?gren wrote: >>> >>> Memtest doesn''t want potential errors to be hidden by ECC, so it >>> disables ECC to see them if they occur. >>> >> >> still it is valid question - is there a way under OS to check if ECC >> is disabled or enabled? > > Maybe something in the output of smbios(1M) >bingo! thank you. -- Robert Milkowski http://milek.bogspot.com
Simon Breden
2010-Mar-03 23:32 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
I ran smbios and for the memory-related section I saw the following: ID SIZE TYPE 64 15 SMB_TYPE_MEMARRAY (physical memory array) Location: 3 (system board or motherboard) Use: 3 (system memory) ECC: 3 (none) Number of Slots/Sockets: 4 Memory Error Data: Not Supported Max Capacity: 4294967296 bytes ID SIZE TYPE 65 62 SMB_TYPE_MEMDEVICE (memory device) Manufacturer: None Serial Number: None Asset Tag: None Location Tag: DIMM_B1 Part Number: None Physical Memory Array: 64 Memory Error Data: Not Supported Total Width: 72 bits Data Width: 64 bits Size: 1073741824 bytes Form Factor: 9 (DIMM) Set: None Memory Type: 18 (DDR) Flags: 0x0 Speed: 1ns Device Locator: DIMM_B1 Bank Locator: Bank0/1 ...>From this output it appears as if Solaris, via the BIOS I presume, it looks like my BIOS thinks it doesn''t have ECC RAM, even though all the memory modules are indeed ECC modules.Might be time to check (1) my current BIOS settings, even though I felt sure ECC was enabled in the BIOS already, and (2) check for a newer BIOS update. A pity, as the machine has been rock-solid so far, and I don''t like changing stable BIOSes... Here''s the start of the SMBIOS output: # smbios ID SIZE TYPE 0 104 SMB_TYPE_BIOS (BIOS information) Vendor: Phoenix Technologies, LTD Version String: ASUS M2N-SLI DELUXE ACPI BIOS Revision 1502 Release Date: 03/31/2008 Address Segment: 0xe000 ROM Size: 524288 bytes Image Size: 131072 bytes Characteristics: 0x7fcb9e80 SMB_BIOSFL_PCI (PCI is supported) SMB_BIOSFL_PLUGNPLAY (Plug and Play is supported) SMB_BIOSFL_APM (APM is supported) SMB_BIOSFL_FLASH (BIOS is Flash Upgradeable) SMB_BIOSFL_SHADOW (BIOS shadowing is allowed) SMB_BIOSFL_CDBOOT (Boot from CD is supported) SMB_BIOSFL_SELBOOT (Selectable Boot supported) SMB_BIOSFL_ROMSOCK (BIOS ROM is socketed) SMB_BIOSFL_EDD (EDD Spec is supported) SMB_BIOSFL_525_360K (int 0x13 5.25" 360K floppy) SMB_BIOSFL_525_12M (int 0x13 5.25" 1.2M floppy) SMB_BIOSFL_35_720K (int 0x13 3.5" 720K floppy) SMB_BIOSFL_35_288M (int 0x13 3.5" 2.88M floppy) SMB_BIOSFL_I5_PRINT (int 0x5 print screen svcs) SMB_BIOSFL_I9_KBD (int 0x9 8042 keyboard svcs) SMB_BIOSFL_I14_SER (int 0x14 serial svcs) SMB_BIOSFL_I17_PRINTER (int 0x17 printer svcs) SMB_BIOSFL_I10_CGA (int 0x10 CGA svcs) Characteristics Extension Byte 1: 0x33 SMB_BIOSXB1_ACPI (ACPI is supported) SMB_BIOSXB1_USBL (USB legacy is supported) SMB_BIOSXB1_LS120 (LS-120 boot is supported) SMB_BIOSXB1_ATZIP (ATAPI ZIP drive boot is supported) Characteristics Extension Byte 2: 0x5 SMB_BIOSXB2_BBOOT (BIOS Boot Specification supported) SMB_BIOSXB2_ETCDIST (Enable Targeted Content Distrib.) Version Number: 0.0 Embedded Ctlr Firmware Version Number: 0.0 Cheers, Simon -- This message posted from opensolaris.org
Miles Nordin
2010-Mar-03 23:52 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
>>>>> "sb" == Simon Breden <sbreden at gmail.com> writes:sb> ASUS M2N-SLI DELUXE ACPI BIOS If it is AMD then: http://ar.opensolaris.org/jive/message.jspa?messageID=345422#345422 scripts need ''setpci'' for solaris: http://blogs.sun.com/thebentzone/entry/compiling_pciutils_lspci_on_solaris (untested) also keep in mind it is not just on/off. You need to set the speed of AMD''s hardware scrubber to something reasonable, and verify that solaris will alert you when ECC errors are happening, especially uncorrectable ones, otherwise the memory is not very useful. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100303/2186702a/attachment.bin>
Simon Breden
2010-Mar-04 00:10 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
Thanks Miles, I''ll take a look. Cheers, Simon -- This message posted from opensolaris.org
Thanks for the info everyone! I will now setup scrubbing and verify ecc alerts. Miles, AMD and intel''s new xeon with the integrated memory controller ought to behave and interact with opensolaris the same way, yes? -- This message posted from opensolaris.org
Yes, you are correct. Thanks. -- This message posted from opensolaris.org
Miles Nordin
2010-Mar-04 03:09 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
>>>>> "a" == ace <tojaktoty at gmail.com> writes:a> Miles, AMD and intel''s new xeon with the integrated memory a> controller ought to behave and interact with opensolaris the a> same way, yes? No, I think they''d interact differently. The interaction is ``reporting errors'''' I guess. I think each major memory controller family has in the past needed a separate driver. The solaris memory scrubber might also be described as interaction, and it is afaict silly on AMD because there is a hardware scrubber which is additionally able to scrub the L2 cache. I don''t know if intel has a hardware scrubber so there is another potential difference. Scrubbing may be silly period, statistically: you are betting the exact same row will get hit twice with an error rate of at most once/month seems implausible, but I still like the idea because I''m interested in spotting bad ram or dirty connections more deterministically instead of having to put ``memory load'''' on the machine or something. definitely worth sorting all this out somehow instead of just paying and hoping! sorry I do not have real answers. http://www.beowulf.org/archive/2008-May/021335.html http://www.ece.rochester.edu/~mihuang/PAPERS/hotdep07.pdf -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100303/282e63f4/attachment.bin>
Simon, when I call: ~$ smbios -t SMB_TYPE_MEMARRAY I receive: ID SIZE TYPE 47 15 SMB_TYPE_MEMARRAY (physical memory array) Location: 3 (system board or motherboard) Use: 3 (system memory) [b]ECC: 6 (multi-bit ECC)[/b] Number of Slots/Sockets: 6 Memory Error Data: Not Supported Max Capacity: 25769883776 bytes and under 65 62 SMB_TYPE_MEMDEVICE (memory device) the terminal shows that the modules: [b]Memory Error Data: Not Supported[/b] I recollect reading about a command to call and verify status of ecc memory controller displaying error count. Anyone familiar? Scrubbing would be nice but first I''d like to find where to find any status or readings. I''ve scanned through most of the links posted so far and haven''t found how to get status. Miles states that the intel nehalem xeon memory controller may need a separate driver: " a> Miles, AMD and intel''s new xeon with the integrated memory a> controller ought to behave and interact with opensolaris the a> same way, yes? No, I think they''d interact differently. The interaction is ``reporting errors'''' I guess. I think each major memory controller family has in the past needed a separate driver. " My question in other words, Is there a distinct command in opensolaris to view intel and/or amd memory error reports? -- This message posted from opensolaris.org
"A process will continually scrub the memory, and is capable of correcting any one error per 64-bit word of memory." at http://www.stringliterals.com/?tag=opensolaris. If this is true what is the process and how is it accessed? -- This message posted from opensolaris.org
Henrik Johansson
2010-Mar-04 12:03 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
Hello, On 4 mar 2010, at 10.26, ace <tojaktoty at gmail.com> wrote:> "A process will continually scrub the memory, and is capable of > correcting any one error per 64-bit word of memory." > at http://www.stringliterals.com/?tag=opensolaris. > > If this is true what is the process and how is it accessed?No, it''s a kernel thread, something like: # echo ::thread ! grep scrub Or echo "memscrub_scans_done/U" | mdb-k This depeds om what platform you are on, some platforms do ths in hardware. Google for the later to find some good pages with more info. I''m not at my workstation so mind minor faults. Henrik http://sparcv9.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100304/2832cf35/attachment.html>
Richard PALO
2010-Mar-09 10:51 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
I''m curious to know whether the following output : bash-4.0# echo "memscrub_scans_done/U" | mdb -k memscrub_scans_done: memscrub_scans_done: 1985 means that Solaris considers ECC memory is effectively installed (the fact that it is non-zero)? I have installed unbuffered ECC memory (2x4GB crucial kit CT2KIT51272AA667). The reason is that the vendor (ACER) of the mainboard says it is not supported, and I can not get into the bios any more, but osol boots fine and sees 8GB. Crucial says it''s not supported because Acer says it''s not supported... This is an MCP78S based motherboard (apparently equivalent Asus and Gigabyte boards are _supported_ platforms for this memory)... The following output from smbios: 0 78 SMB_TYPE_BIOS (BIOS information) Vendor: Phoenix Technologies, LTD Version String: R01-B0 Release Date: 03/31/2009 Address Segment: 0xe000 ROM Size: 524288 bytes Image Size: 131072 bytes Characteristics: 0x7fcb9e90 SMB_BIOSFL_ISA (ISA is supported) SMB_BIOSFL_PCI (PCI is supported) SMB_BIOSFL_PLUGNPLAY (Plug and Play is supported) SMB_BIOSFL_APM (APM is supported) SMB_BIOSFL_FLASH (BIOS is Flash Upgradeable) SMB_BIOSFL_SHADOW (BIOS shadowing is allowed) SMB_BIOSFL_CDBOOT (Boot from CD is supported) SMB_BIOSFL_SELBOOT (Selectable Boot supported) SMB_BIOSFL_ROMSOCK (BIOS ROM is socketed) SMB_BIOSFL_EDD (EDD Spec is supported) SMB_BIOSFL_525_360K (int 0x13 5.25" 360K floppy) SMB_BIOSFL_525_12M (int 0x13 5.25" 1.2M floppy) SMB_BIOSFL_35_720K (int 0x13 3.5" 720K floppy) SMB_BIOSFL_35_288M (int 0x13 3.5" 2.88M floppy) SMB_BIOSFL_I5_PRINT (int 0x5 print screen svcs) SMB_BIOSFL_I9_KBD (int 0x9 8042 keyboard svcs) SMB_BIOSFL_I14_SER (int 0x14 serial svcs) SMB_BIOSFL_I17_PRINTER (int 0x17 printer svcs) SMB_BIOSFL_I10_CGA (int 0x10 CGA svcs) Characteristics Extension Byte 1: 0x33 SMB_BIOSXB1_ACPI (ACPI is supported) SMB_BIOSXB1_USBL (USB legacy is supported) SMB_BIOSXB1_LS120 (LS-120 boot is supported) SMB_BIOSXB1_ATZIP (ATAPI ZIP drive boot is supported) Characteristics Extension Byte 2: 0x5 SMB_BIOSXB2_BBOOT (BIOS Boot Specification supported) SMB_BIOSXB2_ETCDIST (Enable Targeted Content Distrib.) Version Number: 0.0 Embedded Ctlr Firmware Version Number: 0.0 ID SIZE TYPE 1 78 SMB_TYPE_SYSTEM (system information) Manufacturer: Acer Product: Aspire X3200 Version: R01-A3 Serial Number: 9E3PM75C7P839053093003 UUID: ffffffff-ffff-ffff-ffff-ffffffffffff Wake-Up Event: 0x6 (power switch) SKU Number: Family: ID SIZE TYPE 2 62 SMB_TYPE_BASEBOARD (base board) Manufacturer: Acer Product: WMCP78M Version: Serial Number: 0000000000000000000000 Asset Tag: Location Tag: Chassis: 48 Flags: 0x1 SMB_BBFL_MOTHERBOARD (board is a motherboard) Board Type: 0xa (motherboard) ID SIZE TYPE 3 76 SMB_TYPE_CHASSIS (system enclosure or chassis) Manufacturer: Acer Version: Serial Number: 0000000000000000000000 Asset Tag: 0000000000000000000000 OEM Data: 0x0 Lock Present: N Chassis Type: 0x3 (desktop) Boot-Up State: 0x2 (unknown) Power Supply State: 0x2 (unknown) Thermal State: 0x2 (unknown) Chassis Height: 0u Power Cords: 0 Element Records: 0 ID SIZE TYPE 4 101 SMB_TYPE_PROCESSOR (processor) Manufacturer: AMD Version: AMD Phenom(tm) 9550 Quad-Core Processor Serial Number: Asset Tag: Location Tag: Socket AM2 Part Number: Family: 1 (other) CPUID: 0x178bfbff00100f23 Type: 3 (central processor) Socket Upgrade: 4 (ZIF socket) Socket Status: Populated Processor Status: 1 (enabled) Supported Voltages: 1.2V External Clock Speed: Unknown Maximum Speed: 2200MHz Current Speed: 2200MHz L1 Cache: 8 L2 Cache: 9 L3 Cache: None ID SIZE TYPE 8 33 SMB_TYPE_CACHE (processor cache) Location Tag: Internal Cache Level: 1 Maximum Installed Size: 131072 bytes Installed Size: 131072 bytes Speed: Unknown Supported SRAM Types: 0x20 SMB_CAT_SYNC (synchronous) Current SRAM Type: 0x20 (synchronous) Error Correction Type: 2 (unknown) Logical Cache Type: 2 (unknown) Associativity: 2 (unknown) Mode: 1 (write-back) Location: 0 (internal) Flags: 0x1 SMB_CAF_ENABLED (enabled at boot time) ID SIZE TYPE 9 33 SMB_TYPE_CACHE (processor cache) Location Tag: External Cache Level: 2 Maximum Installed Size: 524288 bytes Installed Size: 524288 bytes Speed: Unknown Supported SRAM Types: 0x20 SMB_CAT_SYNC (synchronous) Current SRAM Type: 0x20 (synchronous) Error Correction Type: 2 (unknown) Logical Cache Type: 2 (unknown) Associativity: 2 (unknown) Mode: 1 (write-back) Location: 0 (internal) Flags: 0x1 SMB_CAF_ENABLED (enabled at boot time) ... ID SIZE TYPE 23 15 SMB_TYPE_MEMARRAY (physical memory array) Location: 3 (system board or motherboard) Use: 3 (system memory) ECC: 3 (none) Number of Slots/Sockets: 2 Memory Error Data: Not Supported Max Capacity: 4294967296 bytes ID SIZE TYPE 24 86 SMB_TYPE_MEMDEVICE (memory device) Manufacturer: 2C00000000000000 Serial Number: E40EE26D Asset Tag: None Location Tag: A0 Part Number: 18HTF51272AY-667A Physical Memory Array: 23 Memory Error Data: Not Supported Total Width: 64 bits Data Width: 64 bits Size: 4294967296 bytes Form Factor: 9 (DIMM) Set: None Memory Type: 19 (DDR2) Flags: 0x80 SMB_MDF_SYNC (synchronous) Speed: Unknown Device Locator: A0 Bank Locator: Bank0/1 ID SIZE TYPE 25 86 SMB_TYPE_MEMDEVICE (memory device) Manufacturer: 2C00000000000000 Serial Number: E40EE242 Asset Tag: None Location Tag: A1 Part Number: 18HTF51272AY-667A Physical Memory Array: 23 Memory Error Data: Not Supported Total Width: 64 bits Data Width: 64 bits Size: 4294967296 bytes Form Factor: 9 (DIMM) Set: None Memory Type: 19 (DDR2) Flags: 0x80 SMB_MDF_SYNC (synchronous) Speed: Unknown Device Locator: A1 Bank Locator: Bank2/3 ID SIZE TYPE 26 15 SMB_TYPE_MEMARRAYMAP (memory array mapped address) Physical Memory Array: 23 Devices per Row: 1 Physical Address: 0x0 Size: 8589934592 bytes ID SIZE TYPE 27 19 SMB_TYPE_MEMDEVICEMAP (memory device mapped address) Memory Device: 24 Memory Array Mapped Address: 26 Physical Address: 0x0 Size: 4294967296 bytes Partition Row Position: 1 Interleave Position: 0 Interleave Data Depth: 0 ID SIZE TYPE 28 19 SMB_TYPE_MEMDEVICEMAP (memory device mapped address) Memory Device: 25 Memory Array Mapped Address: 26 Physical Address: 0x100000000 Size: 4294967296 bytes Partition Row Position: 1 Interleave Position: 0 Interleave Data Depth: 0 -- This message posted from opensolaris.org
R.G. Keen
2010-Mar-09 14:52 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
Yay! Something where I can contribute! Iam a hardware guy trying to live in a software world, but I think I know how this one works.> The reason is that the vendor (ACER) of the mainboard > says it is not supported, and I can not get into the > bios any more, but osol boots fine and sees 8GB. > Crucial says it''s not supported because Acer says > it''s not supported... This is an MCP78S based > motherboard (apparently equivalent Asus and Gigabyte > boards are _supported_ platforms for this > memory)...The chipset may support ECC memory, and reply just fine to the OS and drivers that no errors have occurred, and the memory chips may check ECC and generate the "ECC error" signal to the chip set, but if the motherboard does not have a copper trace between the pin on the memory socket that connects to the ECC error pin on the memory DIMM and the pin on the chipset that receives the error signal, the chip set will never "hear" the memory complain about ECC errors whether they happen or not. The phone line is cut. If the motherboard maker doesn''t assure you it''s connected by telling you that explicitly, or worse yet says it''s not supported, chances are it''s not supported. "Support" for a memory DIMM does not necessarily mean that the ECC works, only that the regular memory works. I did not buy a Gigabyte board for the home server I''m laboriously (for a hardware guy in a software land) getting running, because although Gigabyte says they support the ECC memory DIMMs, they do not have any BIOS means for enabling/disabling the ECC in BIOS, and that tells me that they *tolerate* ECC DIMMs rather than *using* the ECC functions. ASUS, for the same chipset in my case, has a BIOS setting for enable/disable ECC reporting, so they have at least considered. it. I have the same issue coming up, because even if ASUS lets you turn reporting on an off, that''s NOT a guarantee that the copper trace is there and all connected. I read in this forum a method for inducing ECC errors involving holding a tungsten incandescent bulb near the DIMMs to induce errors. It''s worth a search. I will be doing that test when I get to the point where I have the thing running well enough for the test to be meaningful. R.G. -- This message posted from opensolaris.org
Richard PALO
2010-Mar-10 07:03 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
Hi, thanks for the reply... I guess I''m so far as well, but my question is targetted at understanding the realworld implication of the kernel software memory scrubber. That is, in looking through the code a bit I notice that if hardware ECC is active the software scrubber is disabled. It is also disabled in absence of ECC memory (or unmatched ECC memory). In my particular case: bash-4.0# echo "memscrub_scans_done/U" | mdb -k memscrub_scans_done: memscrub_scans_done: 1985 It appears not to be disabled. My question, I guess, put differently is if it _is_ enabled does it indeed do something useful in the sense of error detection? That is, if it is enabled but *cannot* determine anything related to ECC, _why_ is it running in the first place? That is, if ECC is crippled then the software scrubber gives false impression of doing something useful and is perhaps a bug. On the other hand, if it *can* determine ECC (not crippled), then can we conclude that it is effective [enough] to be able to run as a small and reasonably reliable server? That is, correct correctable errors and be able to log memory errors for eventual action... cheers -- This message posted from opensolaris.org
R.G. Keen
2010-Mar-10 23:55 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
I did some reading on DDRn ram and controller chips and how they do ECC. Sorry, but I was moderately incorrect. Here''s closer to what happens. DDRn memory has no ECC logic on the DIMMs. What it has is an additional eight bits of memory for each 64 bit read/write operation. That is, for ECC DIMMs, the reads and writes are 72 bits wide, not 64. The extra 8 bits are read/written just like any other bits. The actual operation of error checking and correction happens in the memory controllers (for the ones I looked at at least). These memory controller chipsets do the actual interaction with the DIMMs and (a) determine what, if any, bits get written to all 64 or 72 bits as well as (b) looking at the data back from a read to see if that they get back is acceptable. - if the memory controller chipset tolerates only 64 bit wide DIMMs but not 72 bit wide ones, it cannot do ECC. - if the memory controller tolerates both 64 bit and 72 bit wide DIMMs, perhaps by ignoring the "extra" bits in a 64 wide read/write, then either style DIMM can be used, but if the memory controller doesn''t computer, write, and then check the extra eight bit for errors, ECC never happens - if the controller computes the extra checking bits and sends them with write, and also checks them on a read, it has the potential to do effective ECC in the controller itself, in hardware. - for the couple of chipsets I looked at, if i read correctly, the controller is set up by the BIOS for doing or not doing ECC, and it may signal back to the software that an ECC has happened. I was incorrect - for DDRn, it''s not a signalling line that something is wrong. Motherboards can force ECC not to happen by either not carrying the extra bits to/from the DIMM sockets, in which case even if the memory controller supports ECC internall, it will not work. This is one method for tolerating either kind of DIMM, I guess. Another is to program the chipset in BIOS to not do ECC. What I''m not clear on is what OS does with this. I''m not competent to delve through the OS and find where the connection to the memory controller ECC enable/setup happens and what the ramifications are. And I don''t know what the link between hardware ECC write/read in the memory is, and a software scrub. Is the nature of the scrub that it walks through memory doing read/write/read and looking at the ECC reply in hardware? I came up with an all-software scrubbing technique, by doing a software block check much like zfs, but that seems very impractical. -- This message posted from opensolaris.org
Tonmaus
2010-Mar-11 13:43 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
> Is the nature of the scrub that it walks through > memory doing read/write/read and looking at the ECC > reply in hardware?I think ZFS has no specific mechanisms in respect to RAM integrity. It will just count on a healthy and robust foundation for any component in the machine. As far as I understand it''s just a good idea to have ECC RAM once you talk a certain amount of data that will inevitably go through a certain path. Servers controlling PB of data are certainly a case for ECC memory in my regard. -Tonmaus -- This message posted from opensolaris.org
R.G. Keen
2010-Mar-11 15:49 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
> I think ZFS has no specific mechanisms in respect to > RAM integrity. It will just count on a healthy and > robust foundation for any component in the machine.I''d really like to understand what OS does with respect to ECC. Anyone who does understand the internal operation and can comment would be doing me a real favor by ''splaining this to me. 8-) And yes, it''s the OS, not zfs, that would do the memory operations. - I don''t think there is a software mechanism for detecting and/or correcting memory errors. I''ll go read up on memtest, but I suspect it is just that - a memory testing routine that writes to memory, reads it back, and then tries to discover whether what it read back is what it sent. This is a good way to discover hard, stuck faults in a memory array, but cannot cope well with soft and intermittent errors. - ECC is great for dealing with soft, intermittent errors, because it completely prevents single, infrequent errors from causing "bit rot" by polluting memory which is then flushed back to disk (and then protected from rot in disk by zfs.) - ECC can hide a rising soft error rate from a failing memory. This is good in that it holds off the day when things crash, but bad in that the data is in there to do preventive maintenance to replace the failing unit if it''s bubbled up so the user can see it. It''s bad if it hides errors from a memory testing routine, as has been noted in this thread. - You need to turn off hardware/chipset ECC to get a real result from a software write/read back memory test. Otherwise all you get back is ''yep, everything''s all right''. I think I need to get into the OS forum to understand this better. -- This message posted from opensolaris.org
Tonmaus
2010-Mar-11 16:15 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
> I''d really like to understand what OS does with > respect to ECC.In information technology ECC (Error Correction Code, Wikipedia article is worth reading.) normally protects point-to-point "channels". Hence, this is entirely a "hardware" thing here. Regards, Tonmaus -- This message posted from opensolaris.org
Robert Milkowski
2010-Mar-11 16:30 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
On 11/03/2010 15:49, R.G. Keen wrote:>> I think ZFS has no specific mechanisms in respect to >> RAM integrity. It will just count on a healthy and >> robust foundation for any component in the machine. >> > I''d really like to understand what OS does with respect to ECC. Anyone who does understand the internal operation and can comment would be doing me a real favor by ''splaining this to me. 8-) > > And yes, it''s the OS, not zfs, that would do the memory operations. > > - I don''t think there is a software mechanism for detecting and/or correcting memory errors. I''ll go read up on memtest, but I suspect it is just that - a memory testing routine that writes to memory, reads it back, and then tries to discover whether what it read back is what it sent. This is a good way to discover hard, stuck faults in a memory array, but cannot cope well with soft and intermittent errors. > - ECC is great for dealing with soft, intermittent errors, because it completely prevents single, infrequent errors from causing "bit rot" by polluting memory which is then flushed back to disk (and then protected from rot in disk by zfs.) > - ECC can hide a rising soft error rate from a failing memory. This is good in that it holds off the day when things crash, but bad in that the data is in there to do preventive maintenance to replace the failing unit if it''s bubbled up so the user can see it. It''s bad if it hides errors from a memory testing routine, as has been noted in this thread. > - You need to turn off hardware/chipset ECC to get a real result from a software write/read back memory test. Otherwise all you get back is ''yep, everything''s all right''. > > I think I need to get into the OS forum to understand this better. >Solaris *can* detect ECC errors (correctable or not) and it will be feeded into FMA. Then FMA will take appropriate actions, for example if there are more than N correctable errors in a given memory page within 24h window FMA will migrate data in that page somewhere else and mark it dead. You will loose usually 8kB or 8kB of memory but at least you are minimizing risk. If it was an ucorrectable error then it depends on what was referring the page - if only a user land application that it will get killed (and restarted by SMF or cluster), if it was reffered to by kernel then entire OS will panic. For more information look at: http://blogs.sun.com/mws/entry/fma_on_x64_and_at http://milek.blogspot.com/2006/05/psh-smf-less-downtime.html -- Robert Milkowski http://milek.blogspot.com
Christo Kutrovsky
2010-Mar-11 17:20 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
Robert, That''s great info. Do you know how you can check the number of CORRECTED errors by ECC in OpenSolaris? -- This message posted from opensolaris.org
Richard Elling
2010-Mar-11 23:40 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
On Mar 11, 2010, at 7:49 AM, R.G. Keen wrote:>> I think ZFS has no specific mechanisms in respect to >> RAM integrity. It will just count on a healthy and >> robust foundation for any component in the machine. > I''d really like to understand what OS does with respect to ECC. Anyone who does understand the internal operation and can comment would be doing me a real favor by ''splaining this to me. 8-)There are multiple levels of ECC, error reporting, and scrubs at work. The exact ones depend largely on the hardware and how it handles ECC. The M9000-class machines, for instance, have sophisticated memory scrubbing built into the memory controllers and include options for memory mirroring. Some Xeon models support memory mirroring and some PC vendors even claimed to have hot swappable DIMMs, mirrored of course. I don''t have the intestinal fortitude to hot swap a DIMM on a PC, so I''ll leave that to the glossies. Some processors just go bonkers and abort when they see an uncorrectable ECC error... not much Solaris can do when it isn''t running. So there are hardware scrubbers in many modern servers and software scrubbers in Solaris. For an interesting read on how Solaris handles memory faults, see the pointers at http://blogs.sun.com/relling/entry/analysis_of_memory_page_retirement On Mar 11, 2010, at 9:20 AM, Christo Kutrovsky wrote:> Do you know how you can check the number of CORRECTED errors by ECC in OpenSolaris?FMA logs the errors as seen by Solaris or as reported by hardware that notifies Solaris. For some systems, these are also logged to the system controller. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)
Alex Krasnov
2010-Jul-12 21:07 UTC
[zfs-discuss] How to verify ecc for ram is active and enabled?
> From this output it appears as if Solaris, via the > BIOS I presume, it looks like my BIOS thinks it > doesn''t have ECC RAM, even though all the memory > modules are indeed ECC modules. > > Might be time to check (1) my current BIOS settings, > even though I felt sure ECC was enabled in the BIOS > already, and (2) check for a newer BIOS update. A > pity, as the machine has been rock-solid so far, and > I don''t like changing stable BIOSes...My apologies for resurrecting this thread, but I am curious whether you have had any success enabling ECC on your M2N-SLI machine, using either the BIOS or the setpci scripts. I am experiencing a similar issue with my M2N32-SLI machine. The BIOS reports that ECC is turned on, but smbios reports that it is turned off: ID SIZE TYPE 0 106 SMB_TYPE_BIOS (BIOS information) Vendor: Phoenix Technologies, LTD Version String: ASUS M2N32-SLI DELUXE ACPI BIOS Revision 2001 Release Date: 05/19/2008 Address Segment: 0xe000 ROM Size: 1048576 bytes Image Size: 131072 bytes Characteristics: 0x7fcb9e80 ID SIZE TYPE 63 15 SMB_TYPE_MEMARRAY (physical memory array) Location: 3 (system board or motherboard) Use: 3 (system memory) ECC: 3 (none) Number of Slots/Sockets: 4 Memory Error Data: Not Supported Max Capacity: 17179869184 bytes -- This message posted from opensolaris.org