On 1/17/2018 3:39 PM, Don Lewis wrote:> On 17 Jan, Mike Tancsa wrote: >> On 1/17/2018 8:43 AM, Pete French wrote: >>> >>> Are you running the latest STABLE ? There were some patches for Ryzen >>> which went in I belive, and might affect te stability. Specificly the >>> chnages to stop it locking up when executing code in the top page ? >> >> Hi, >> I was testing with RELENG_11 as of 2 days ago. The fix seems to be there >> >> # sysctl -A hw.lower_amd64_sharedpage >> hw.lower_amd64_sharedpage: 1 >> >> Would love to find a class of motherboard that pushes its "You dont need >> to dork around with any BIOS settings. It just works. Oh, and we have a >> hardware watchdog too".... ipmi would be stellar. > > The shared page change fixed the random lockup and silent reboot problem > for me. I've got a 1700X eight core CPU and a Gigabyte X370 Gaming 5. I > did have to RMA my CPU (it was an early one) because it had the problem > with random segfaults that seemed to be triggered by process migration > between CPU cores. I still haven't switched over to using it for > package builds because I see more random fallout than on my older > package builder. I'm not blaming the hardware for that at this point > because I see a lot of the same issues on my older machine, but less > frequently. > > One thing to watch (though it should be less critical with a six core > CPU) is VRM cooling. I removed the stupid plastic shroud over the VRM > sink on my motherboard so that it gets some more airflow.Thanks! I will confirm the cooling. I tried just now looking at the CPU FAN control in the BIOS and up'd it to "turbo" from the default. Does amdtmp.ko work with your chipset ? Nothing on mine unfortunately, so I cant tell from the OS if its running hot. Is there a way to see if your CPU is old and has that bug ? I havent seen any segfaults on the few dozen buildworlds I have done. So far its always been a total lockup and not crash with RELENG11. x86info v1.31pre Found 12 identical CPUs Extended Family: 8 Extended Model: 0 Family: 15 Model: 1 Stepping: 1 CPU Model (x86info's best guess): AMD Zen Series Processor (ZP-B1) Processor name string (BIOS programmed): AMD Ryzen 5 1600 Six-Core Processor Monitor/Mwait: min/max line size 64/64, ecx bit 0 support, enumeration extension SVM: revision 1, 32768 ASIDs, np, lbrVirt, SVMLock, NRIPSave, TscRateMsr, VmcbClean, FlushByAsid, DecodeAssists, PauseFilter, PauseFilterThreshold Address Size: 48 bits virtual, 48 bits physical The physical package has 12 of 16 possible cores implemented. running at an estimated 3.20GHz ---Mike -- ------------------- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, mike at sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/
I'm running 11-STABLE from 12/9. amdtemp works for me. It also has the systl indicating that it it has the shared page fix. I'm pretty sure I've seen the lockups since then. I'll update to the latest STABLE and see what happens. One weird thing about my experience is that if I keep something running continuously like the distributed.net client on 6 of 12 possible threads, it keeps the system up for MUCH longer than without. This is a home server and very lightly loaded (one could argue insanely overpowered for the use case). I'm glad to see that there has been some attention on this. I was a little disappointed by the earlier thread. I'm happy to help troubleshoot, but I'm not sure what information I can gather from a hard locked system that doesn't even show anything on the console. -- Nimrod On Wed, Jan 17, 2018 at 4:01 PM Mike Tancsa <mike at sentex.net> wrote:> On 1/17/2018 3:39 PM, Don Lewis wrote: > > On 17 Jan, Mike Tancsa wrote: > >> On 1/17/2018 8:43 AM, Pete French wrote: > >>> > >>> Are you running the latest STABLE ? There were some patches for Ryzen > >>> which went in I belive, and might affect te stability. Specificly the > >>> chnages to stop it locking up when executing code in the top page ? > >> > >> Hi, > >> I was testing with RELENG_11 as of 2 days ago. The fix seems to > be there > >> > >> # sysctl -A hw.lower_amd64_sharedpage > >> hw.lower_amd64_sharedpage: 1 > >> > >> Would love to find a class of motherboard that pushes its "You dont need > >> to dork around with any BIOS settings. It just works. Oh, and we have a > >> hardware watchdog too".... ipmi would be stellar. > > > > The shared page change fixed the random lockup and silent reboot problem > > for me. I've got a 1700X eight core CPU and a Gigabyte X370 Gaming 5. I > > did have to RMA my CPU (it was an early one) because it had the problem > > with random segfaults that seemed to be triggered by process migration > > between CPU cores. I still haven't switched over to using it for > > package builds because I see more random fallout than on my older > > package builder. I'm not blaming the hardware for that at this point > > because I see a lot of the same issues on my older machine, but less > > frequently. > > > > One thing to watch (though it should be less critical with a six core > > CPU) is VRM cooling. I removed the stupid plastic shroud over the VRM > > sink on my motherboard so that it gets some more airflow. > > Thanks! I will confirm the cooling. I tried just now looking at the CPU > FAN control in the BIOS and up'd it to "turbo" from the default. Does > amdtmp.ko work with your chipset ? Nothing on mine unfortunately, so I > cant tell from the OS if its running hot. > > Is there a way to see if your CPU is old and has that bug ? I havent > seen any segfaults on the few dozen buildworlds I have done. So far its > always been a total lockup and not crash with RELENG11. > > x86info v1.31pre > Found 12 identical CPUs > Extended Family: 8 Extended Model: 0 Family: 15 Model: 1 Stepping: 1 > CPU Model (x86info's best guess): AMD Zen Series Processor (ZP-B1) > Processor name string (BIOS programmed): AMD Ryzen 5 1600 Six-Core > Processor > > Monitor/Mwait: min/max line size 64/64, ecx bit 0 support, enumeration > extension > SVM: revision 1, 32768 ASIDs, np, lbrVirt, SVMLock, NRIPSave, > TscRateMsr, VmcbClean, FlushByAsid, DecodeAssists, PauseFilter, > PauseFilterThreshold > Address Size: 48 bits virtual, 48 bits physical > The physical package has 12 of 16 possible cores implemented. > running at an estimated 3.20GHz > > > > > ---Mike > > > > -- > ------------------- > Mike Tancsa, tel +1 519 651 3400 <(519)%20651-3400> > Sentex Communications, mike at sentex.net > Providing Internet services since 1994 www.sentex.net > Cambridge, Ontario Canada http://www.tancsa.com/ > _______________________________________________ > freebsd-stable at freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org" >-- -- Nimrod
On 17 Jan, Mike Tancsa wrote:> On 1/17/2018 3:39 PM, Don Lewis wrote: >> On 17 Jan, Mike Tancsa wrote: >>> On 1/17/2018 8:43 AM, Pete French wrote: >>>> >>>> Are you running the latest STABLE ? There were some patches for Ryzen >>>> which went in I belive, and might affect te stability. Specificly the >>>> chnages to stop it locking up when executing code in the top page ? >>> >>> Hi, >>> I was testing with RELENG_11 as of 2 days ago. The fix seems to be there >>> >>> # sysctl -A hw.lower_amd64_sharedpage >>> hw.lower_amd64_sharedpage: 1 >>> >>> Would love to find a class of motherboard that pushes its "You dont need >>> to dork around with any BIOS settings. It just works. Oh, and we have a >>> hardware watchdog too".... ipmi would be stellar. >> >> The shared page change fixed the random lockup and silent reboot problem >> for me. I've got a 1700X eight core CPU and a Gigabyte X370 Gaming 5. I >> did have to RMA my CPU (it was an early one) because it had the problem >> with random segfaults that seemed to be triggered by process migration >> between CPU cores. I still haven't switched over to using it for >> package builds because I see more random fallout than on my older >> package builder. I'm not blaming the hardware for that at this point >> because I see a lot of the same issues on my older machine, but less >> frequently. >> >> One thing to watch (though it should be less critical with a six core >> CPU) is VRM cooling. I removed the stupid plastic shroud over the VRM >> sink on my motherboard so that it gets some more airflow. > > Thanks! I will confirm the cooling. I tried just now looking at the CPU > FAN control in the BIOS and up'd it to "turbo" from the default. Does > amdtmp.ko work with your chipset ? Nothing on mine unfortunately, so I > cant tell from the OS if its running hot. > > Is there a way to see if your CPU is old and has that bug ? I havent > seen any segfaults on the few dozen buildworlds I have done. So far its > always been a total lockup and not crash with RELENG11. > > x86info v1.31pre > Found 12 identical CPUs > Extended Family: 8 Extended Model: 0 Family: 15 Model: 1 Stepping: 1 > CPU Model (x86info's best guess): AMD Zen Series Processor (ZP-B1) > Processor name string (BIOS programmed): AMD Ryzen 5 1600 Six-Core > ProcessorMy original CPU had a date code of 1708SUT (8th week of 2017 I think), and the replacement has a date code of 1733SUS. There's a humungous discussion thread here <https://community.amd.com/thread/215773> where date codes are discussed. As I recall, the first replacement parts shipped had dates codes somewhere in the mid 20's, but I think AMD was still hand screening parts at that point. My replacement came in a sealed box, so it wasn't hand screened and AMD probably was able to screen for this problem in their production test.