Completely fresh,virgin install of b115 SXCE ,zero modifications done. If select XVM from boot menu,it comes all the way till "Starting Desktop login on Display:0..." and reboots. Regular Solaris(not XVM) boots fine. Intel XEON CPU,8G RAM Intel S3210SHLX motherboard(http://www.intel.com/Products/Server/Motherboards/Entry-S3200SH/Entry-S3200SH-overview.htm) Please help and thanks a lot in advance. -- This message posted from opensolaris.org
2009/6/22 Alex Gor <no-reply@opensolaris.org>:> Completely fresh,virgin install of b115 SXCE ,zero modifications done. > If select XVM from boot menu,it comes all the way till > "Starting Desktop login on Display:0..." and reboots.After is has crashed and you boot into non xVM, does it save a crash dump to /var/crash/`hostname` ?
Alex Gor wrote:> Completely fresh,virgin install of b115 SXCE ,zero modifications done. > If select XVM from boot menu,it comes all the way till > "Starting Desktop login on Display:0..." and reboots. > Regular Solaris(not XVM) boots fine. > > Intel XEON CPU,8G RAM > Intel S3210SHLX motherboard(http://www.intel.com/Products/Server/Motherboards/Entry-S3200SH/Entry-S3200SH-overview.htm) > > Please help and thanks a lot in advance.I know you said zero modifications done.. Just to be 101% sure :-) Did you install the downloadable nvidia drivers? This is what will happen if you don''t use the nVidia graphics driver which is distributed on the Solaris iso. Assuming you didn''t, what graphics chipset are you using? i.e. do you have an PCI-E graphics card installed? MRJ
Yes,files are there(/var/crash/`hostname`) bounds unix.1 unix.3 unix.5 vmcore.1 vmcore.3 vmcore.5 unix.0 unix.2 unix.4 vmcore.0 vmcore.2 vmcore.4 Which one should I take a look at ? Thanks. -- This message posted from opensolaris.org
OK.This is a situation: By default it has it''s own buit-in video card which is identified under xorg.conf as Identifier "Card0" Driver "mga" VendorName "Matrox Graphics, Inc." BoardName "MGA G200e [Pilot] ServerEngines (SEP1)" BusID "PCI:3:0:0" I tried to replace Driver from "mga" to "vesa" but didn''t make any difference. Then I installed pci-e ATI RAdeon video card Identifier "Card0" Driver "radeonhd" VendorName "ATI Technologies Inc" BoardName "RV516 [Radeon X1300/X1550 Series]" BusID "PCI:1:0:0" (because I thought the rebooting issue is because of problematic video card) but it didn''t make any difference either...still rebooting. So I came to conclusion the issue has nothing to do with video card.May be I''m wrong. Thanks for all your help. -- This message posted from opensolaris.org
2009/6/22 Alex Gor <no-reply@opensolaris.org>:> Yes,files are there(/var/crash/`hostname`) > > bounds unix.1 unix.3 unix.5 vmcore.1 vmcore.3 vmcore.5 > unix.0 unix.2 unix.4 vmcore.0 vmcore.2 vmcore.4 > > Which one should I take a look at ?Check the timestamps with "ls -l"; use one set (unix.N + vmcore.N) that was created after such a crash. Run "mdb -k N" (where N is the unix.N + vmcore.N file name suffix), e.g. "mdb -k 5". In mdb ::status $<msgbuf $C ::cpuinfo -v
Alex Gor wrote:> OK.This is a situation: > > By default it has it''s own buit-in video card which is identified under xorg.conf as > > Identifier "Card0" > Driver "mga" > VendorName "Matrox Graphics, Inc." > BoardName "MGA G200e [Pilot] ServerEngines (SEP1)" > BusID "PCI:3:0:0" > > > I tried to replace Driver from "mga" to "vesa" but didn''t make any difference. > > Then I installed pci-e ATI RAdeon video card > > Identifier "Card0" > Driver "radeonhd" > VendorName "ATI Technologies Inc" > BoardName "RV516 [Radeon X1300/X1550 Series]" > BusID "PCI:1:0:0" > > > (because I thought the rebooting issue is because of problematic video card) but it didn''t make any difference either...still rebooting. > So I came to conclusion the issue has nothing to do with video card.May be I''m wrong. > > Thanks for all your help.does it panic if you disable X? e.g. under metal, svcadm disable cde-login svcadm disable gdm (reboot into xVM) MRJ
O.K. I think we''re getting there. No, it doesn''t panic. It gives me the following message upon login prompt (but by clicking Enter I can see login prompt and able to login normally) svc.strtd[7] system/xvm/virtd:default failed: transitioned to maintenance see ''svcs -xv'' for details svcs -xv shows the following: svc:/system/xvm/virtd:default (libvirt management daemon) State: maintenance since Mon Jun 22 20:30:08 2009 Reason: Start method failed repeatedly, last exited with status 137. See: http://sun.com/msg/SMF-8000-KS See: man -M /usr/share/man -s 1M libvirtd See: /var/svc/log/system-xvm-virtd:default.log Impact: This service is not running. File /var/svc/log/system-xvm-virtd:default.log is attached Thanks a lot. -- This message posted from opensolaris.org
Full output is attached. Thanks you very much. -- This message posted from opensolaris.org
Alex Gor wrote:> O.K. I think we''re getting there. > No, it doesn''t panic. It gives me the following message upon login prompt (but by clicking Enter I can see login prompt and able to login normally) > > > svc.strtd[7] system/xvm/virtd:default failed: transitioned to maintenance see ''svcs -xv'' for details > > svcs -xv shows the following: > > svc:/system/xvm/virtd:default (libvirt management daemon) > State: maintenance since Mon Jun 22 20:30:08 2009 > Reason: Start method failed repeatedly, last exited with status 137. > See: http://sun.com/msg/SMF-8000-KS > See: man -M /usr/share/man -s 1M libvirtd > See: /var/svc/log/system-xvm-virtd:default.log > Impact: This service is not running. > > File /var/svc/log/system-xvm-virtd:default.log is attached >sigh, yeah this one (virtd) is a know problem with b115. You should be able to create a symbolic link for libgnutls.so.13. something like ln -s /usr/lib/libgnutls.so.26 /usr/lib/libgnutls.so.13 This is fix in b116. MRJ
Alex Gor wrote:> Full output is attached. > > Thanks you very much.Oh, interesting.... ffffff00100165c0 unix:die+10f () ffffff00100166d0 unix:trap+1775 () ffffff00100166e0 unix:cmntrap+12f () ffffff0010016870 unix:atomic_cas_ulong+3 () ffffff0010016910 unix:hati_pte_map+123 () ffffff0010016990 unix:hati_load_common+15d () ffffff0010016a50 unix:hat_devload+198 () ffffff0010016aa0 npe:pcitool_map+de () ffffff0010016b10 npe:pcitool_pciex_cfg_access+e3 () ffffff0010016bf0 npe:pcitool_dev_reg_ops+185 () ffffff0010016c70 npe:pci_common_ioctl+ce () ffffff0010016d00 npe:npe_ioctl+77 () ffffff0010016d40 genunix:cdev_ioctl+45 () ffffff0010016d80 specfs:spec_ioctl+83 () ffffff0010016e00 genunix:fop_ioctl+7b () ffffff0010016f00 genunix:ioctl+18e () ffffff0010016f10 unix:brand_sys_syscall+261 () I wonder if X has switch from xsvc to pcitool? It doesn''t want to map the physical address fec00000. pcitool_map+0xde(fec00000 I have a different code path in xsvc to handle memory vs IO. I''ll have to look at that again... Can you send out the output of ::ptree too? bash-3.2# mdb -k 0 ::ptree Thanks, MRJ
I created a symlink,but it doesn''t do any difference.Exact same message that a file is missing.But the file is definitely there -bash-3.2# ls -lh /usr/lib/libgnutls.so.13 lrwxrwxrwx 1 root root 24 Jun 23 00:02 /usr/lib/libgnutls.so.13 -> /usr/lib/libgnutls.so.26 -bash-3.2# ln -s /usr/lib/libgnutls.so.26 /usr/lib/libgnutls.so.13 ln: cannot create /usr/lib/libgnutls.so.13: File exists Not sure what I''m doing wrong....I will install b116 we''ll see from there. Thanks a lot. -- This message posted from opensolaris.org
Sure.Attached. Thanks. -- This message posted from opensolaris.org
Made clean installation of b116.Exactly same issue. Please see attached crash output of ::status $<msgbuf $C ::cpuinfo –v ::ptree If I disable gdm and cde-login the system boots to command line with no errors. svcs –xv returns nothing. So the problem of virtd cannot find libgnutls.so.13 is gone. But I guess it has nothing to do with my original issue. Please help. Thanks a lot. -- This message posted from opensolaris.org
Mark wrote:> Alex Gor wrote: > > Full output is attached. > > > > Thanks you very much. > > > Oh, interesting.... > > > ffffff00100165c0 unix:die+10f () > ffffff00100166d0 unix:trap+1775 () > ffffff00100166e0 unix:cmntrap+12f () > ffffff0010016870 unix:atomic_cas_ulong+3 () > ffffff0010016910 unix:hati_pte_map+123 () > ffffff0010016990 unix:hati_load_common+15d () > ffffff0010016a50 unix:hat_devload+198 () > ffffff0010016aa0 npe:pcitool_map+de () > ffffff0010016b10 npe:pcitool_pciex_cfg_access+e3 () > ffffff0010016bf0 npe:pcitool_dev_reg_ops+185 () > ffffff0010016c70 npe:pci_common_ioctl+ce () > ffffff0010016d00 npe:npe_ioctl+77 () > ffffff0010016d40 genunix:cdev_ioctl+45 () > ffffff0010016d80 specfs:spec_ioctl+83 () > ffffff0010016e00 genunix:fop_ioctl+7b () > ffffff0010016f00 genunix:ioctl+18e () > ffffff0010016f10 unix:brand_sys_syscall+261 () > > I wonder if X has switch from xsvc to pcitool?AFAIK, yes. I think Xorg 1.5 is using libpciaccess[1], and that library uses pcitool. The scanpci utility uses the same libpciaccess, and there had been quite a few problems [2] [3] [4] with PCIe, memory mapped config space and the switch to libpciaccess. IIRC, a workaround was added to Xorg 1.5 to limit the maximum pci bus that was scanned when searching video hardware, to work around the problem with the kernel''s pcitool code accessing pcie configuration space for invalid pci busses. I think it was easier to add a workaround to Xorg, than to get in the kernel fix for pcitool [5]. 6799812 was fixed some months ago, and I think the next Xorg release 1.6 was integrated into b116. Maybe the workarounds for pcitool that had been added to Xorg 1.5 had been removed from Xorg 1.6 and it is now scanning all possible pci busses? [1] http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6752913 [2] http://defect.opensolaris.org/bz/show_bug.cgi?id=6100 [3] CR 6789879 [4] CR 6799812 [5] http://opensolaris.org/jive/thread.jspa?messageID=334459#334459>It doesn''t want to map the physical address fec00000. > pcitool_map+0xde(fec00000 > have a different code path in xsvc to handle memory > vs IO. I''ll have to look at that again...Could there be a bios bug? Alex, can you try to dump the bios acpi tables for that machine, using the Intel ACPI table utility "iasl"? Attachment 4 of bug 5693 http://defect.opensolaris.org/bz/show_bug.cgi?id=5693#c9 contains a copy of the iasl utility. One file that gets dumped should have a name like "MCFG*.dat", can you post a hexdump of that file? od -x MCFG*.dat The output from prtconf -pv could also be interesting, especially all properties on the root pci node that have property names starting with "ecfg". -- This message posted from opensolaris.org
Attached outputs of ''prtconf -pv'' and od -x MCFG_S3200SHX.dat Thanks. -- This message posted from opensolaris.org
2009/6/23 Alex Gor <no-reply@opensolaris.org>:> Attached outputs of ''prtconf -pv'' and od -x MCFG_S3200SHX.datSorry, the -p option wasn''t ok with prtconf, "prtconf -v" should print the "ecfg" property (e.g. prtconf -v /devices/pci@0,0) But I think the dump of the MCGF data reveals the problem: CFG_BASE_ADDR_ALLOC.base_addr: f0000000 CFG_BASE_ADDR_ALLOC.segment: 0 CFG_BASE_ADDR_ALLOC.start_bno: 0 CFG_BASE_ADDR_ALLOC.end_bno: ff Memory mapped pcie configuration space begins at physical address 0xf0000000 and can be used for pci bus 0 .. 255. That should use all physical addresses from 0xf0000000 ... 0xffffffff. But that can''t work, it overlaps the IOAPIC at physical address 0xfec00000. I suspect it panics when Xorg is probing on pci bus 0xec for video devices, this ends up accessing the physical address for the IOAPIC, and I suspect the hypervisor refuses this. I suspect that another way to crash this system is to boot into xVM snigle user mode, and run /usr/X11/bin/scanpci A bios update might help - I think the CFG_BASE_ADDR_ALLOC.end_bno value in the MCFG table must be reduced so that there is no overlap with physical address ranges that are already in use. OTOH usr/src/uts/i86pc/io/pciex/npe_misc.c function npe_query_acpi_mcfg() could be changed to sanity check for and detect this kind of error.
Bios update fixed the issue. Thanks you very much guys for all your help. -- This message posted from opensolaris.org
2009/6/24 Alex Gor <no-reply@opensolaris.org>:> Bios update fixed the issue.For completeness: how did the MCFG acpi table change after the bios upgrade?
below 0000000 434d 4746 003c 0000 8201 4e49 4554 204c 0000020 3353 3032 5330 5848 0000 0000 534d 5446 0000040 0013 0100 0000 0000 0000 0000 0000 f000 0000060 0000 0000 0000 3f00 0000 0000 0000074 -- This message posted from opensolaris.org
2009/6/25 Alex Gor <no-reply@opensolaris.org>:> below > > 0000000 434d 4746 003c 0000 8201 4e49 4554 204c > 0000020 3353 3032 5330 5848 0000 0000 534d 5446 > 0000040 0013 0100 0000 0000 0000 0000 0000 f000 > 0000060 0000 0000 0000 3f00 0000 0000 > 0000074That looks ok; they''ve reduced the number of pci busses with memory mapped configuration space access from 255 to 63; and the physical base address remains unchanged at 0xf0000000.