thr3ads.net - xen discuss - RHEL 5.2 crash OpenSol after a while [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Jack

2009-Jul-02 06:55 UTC

RHEL 5.2 crash OpenSol after a while

Hello,
I''ve got 9 domain Us.  They are each a RHEL 5.2 instance.  They have 1G
ram, 1 cpu, 100G drive.  They are paravirtualized. The drives used are created
as such:
 pfexec zfs create -s -V 100G datastore/virtMachine1

The hardware is a Dell 2900, 48G ram, 3.06T 15krm sas drives.  OpenSolaris seems
to be fairly happy on this system.

When I ran all zones, everything was fine and fast, however vendor requires
RHEL, and I refuse to give up ZFS, so I had to fire up xVM just so I could run
MySQL inside an x86 container called RHEL5.2

Anyway, these domainUs boot, run, work pretty well (slower than zones by about
17% btw), and generally work fine.

Except that they crash pretty regularly anywhere inbetween 6 and 10 days.
I''ve been searching forums, etc. Not sure what to do.  here''s
a log entry:
Jul  1 15:15:08 ecw-mysql1 unix: [ID 836849 kern.notice] 
Jul  1 15:15:08 ecw-mysql1 ^Mpanic[cpu0]/thread=ffffff005b7e1c80: 
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 683410 kern.notice] BAD TRAP: type=e
(#pf Page fault) rp=ffffff005b7e1120 addr=fffffe0a3e18ec20
Jul  1 15:15:08 ecw-mysql1 unix: [ID 100000 kern.notice] 
Jul  1 15:15:08 ecw-mysql1 unix: [ID 839527 kern.notice] sched: 
Jul  1 15:15:08 ecw-mysql1 unix: [ID 753105 kern.notice] #pf Page fault
Jul  1 15:15:08 ecw-mysql1 unix: [ID 532287 kern.notice] Bad kernel fault at
addr=0xfffffe0a3e18ec20
Jul  1 15:15:08 ecw-mysql1 unix: [ID 243837 kern.notice] pid=0,
pc=0xfffffffffb8a0663, sp=0xffffff005b7e1218, eflags=0x10246
Jul  1 15:15:08 ecw-mysql1 unix: [ID 211416 kern.notice] cr0:
8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 2660<vmxe,xmme,fxsr,mce,pae>
Jul  1 15:15:08 ecw-mysql1 unix: [ID 624947 kern.notice] cr2: fffffe0a3e18ec20
Jul  1 15:15:08 ecw-mysql1 unix: [ID 100000 kern.notice] 
Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        rdi:
fffffe0a3e18ec20 rsi:                0 rdx:         e0508673
Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        rcx:            
3  r8:                0  r9: ffffff0cb9384000
Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        rax:            
0 rbx:         e0508673 rbp: ffffff005b7e12b0
Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        r10:            
0 r11: ffffff0000002000 r12:                0
Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        r13:            
1 r14: fffffe0a3e18ec20 r15:         e0508673
Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        fsb:            
0 gsb: fffffffffbc5ef70  ds:               4b
Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]         es:            
4b  fs:                0  gs:              1c3
Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        trp:            
e err:                3 rip: fffffffffb8a0663
Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]         cs:            
e030 rfl:            10246 rsp: ffffff005b7e1218
Jul  1 15:15:08 ecw-mysql1 unix: [ID 266532 kern.notice]         ss:            
e02b
Jul  1 15:15:08 ecw-mysql1 unix: [ID 100000 kern.notice] 
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1000
unix:die+10f ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1110
unix:trap+1768 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1120
unix:_cmntrap+12f ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e12b0
unix:atomic_cas_ptr+3 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1350
unix:hati_pte_map+160 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e13d0
unix:hati_load_common+15d ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1490
unix:hat_devload+15d ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e14f0
rootnex:rootnex_map_regspec+151 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e15a0
rootnex:rootnex_map+141 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e15f0
genunix:ddi_map+51 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e16e0
npe:npe_bus_map+43d ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1720
pcie_pci:pepb_bus_map+31 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1760
pcie_pci:pepb_bus_map+31 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e17b0
genunix:ddi_map+51 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1870
genunix:ddi_regs_map_setup+d5 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e18c0
genunix:pci_config_setup+69 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1900
pcie:pcie_init_bus+41 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1a30
pcie_pci:pepb_initchild+bc ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1ab0
pcie_pci:pepb_ctlops+276 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1af0
genunix:init_node+78 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1b30
genunix:i_ndi_config_node+fa ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1b60
genunix:i_ndi_init_hw_children+48 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1bc0
genunix:config_immediate_children+83 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1c10
genunix:devi_config_common+a6 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1c60
genunix:mt_config_thread+53 ()
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice] ffffff005b7e1c70
unix:thread_start+8 ()
Jul  1 15:15:08 ecw-mysql1 unix: [ID 100000 kern.notice] 
Jul  1 15:15:08 ecw-mysql1 genunix: [ID 672855 kern.notice] syncing file
systems...
Jul  1 15:15:09 ecw-mysql1 genunix: [ID 904073 kern.notice]  done
Jul  1 15:15:10 ecw-mysql1 genunix: [ID 111219 kern.notice] dumping to
/dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
Jul  1 15:17:51 ecw-mysql1 genunix: [ID 409368 kern.notice] ^M100% done: 1588175
pages dumped, compression ratio 3.44,
Jul  1 15:17:51 ecw-mysql1 genunix: [ID 851671 kern.notice] dump succeeded



Anyway, sometimes they blame xVM hypervisor for the crash, sometimes not. 
I''ve got twin Dell 2900s, have moved the domainUs from one machine to
the other, same results.

Name                                        ID   Mem VCPUs      State   Time(s)
Def                                          1024     1                 0.0
Domain-0                                     0 34154     8     r-----   3871.7
EDB_Bs                                     1  1024     1     -b----    803.8
EDB_Faare                         8  2048     2     -b----   1099.3
EDB_Gral                     7  1024     1     -b----    185.2
EDB_NC                                    6  1024     1     -b----    290.0
EDB_Tg                                 2  1024     1     -b----     45.0
EDB_Way                              3  1024     1     -b----     62.3
EDB_Wel                                   5  1024     1     -b----    278.2
EDB_Wnd                                 9  1024     1     -b----    306.9
EHX_Dbase                                10  4096     1     -b----     51.3
Iine                              4  1024     1     -b----     76.0
Repair                                                   512     1              
13.1

Anyway the crashed occur when the Time(s) for any one domainU gets up around
25000 or so.  These are production databases, so they do get a lot of work.

Anyway, it''s aggravating when  the servers die like that, but zfs is
there helping out, so that''s nice.
no idea if any of this makes sense, it''s late, and I''m not too
concerned about it anymore, however any help would be great!
thanks,
Jack
-- 
This message posted from opensolaris.org

Fajar A. Nugraha

2009-Jul-02 09:28 UTC

head link

Re: RHEL 5.2 crash OpenSol after a while

On Thu, Jul 2, 2009 at 1:55 PM, Jack<no-reply@opensolaris.org>
wrote:> When I ran all zones, everything was fine and fast, however vendor requires
RHEL, and I refuse to give up ZFS, so I had to fire up xVM just so I could run
MySQL inside an x86 container called RHEL5.2
Is that all you''re running on domU? MySQL? Since MySQL is owned by Sun
they SHOULD support running on solaris zone :P
And why 5.2? why not 5.3? Last time I check my 5.3 domU works fine
running alfresco (which includes MySQL). You might want to try
upgrading at least domU kernel.

Also, what version of opensolaris are you running?

-- 
Fajar

Mark Johnson

2009-Jul-02 22:58 UTC

head link

Re: RHEL 5.2 crash OpenSol after a while

Jack wrote:> Hello,
> I''ve got 9 domain Us.  They are each a RHEL 5.2 instance.  They
have > 1G ram, 1 cpu, 100G drive.  They are paravirtualized. The drives used
 > are created as such:>  pfexec zfs create -s -V 100G datastore/virtMachine1
> 
> The hardware is a Dell 2900, 48G ram, 3.06T 15krm sas drives.   > OpenSolaris seems to be fairly happy on this system.

This doesn''t have anything to do with your problem...
But I''m curious, what do you have for CPUs?
Are you limiting dom0 memory on boot?  What about
the number of CPUs dom0 can use? e.g.

    kernel /boot/amd64/xen.gz com1=9600,8n1 console=com1 dom0_mem=2g
dom0_max_vcpus=2 dom0_vcpus_pin=true

> When I ran all zones, everything was fine and fast, however vendor  > requires RHEL, and I refuse to give up ZFS, so I had to fire up xVM
 > just so I could run MySQL inside an x86 container called
RHEL5.2> 
> Anyway, these domainUs boot, run, work pretty well (slower than 
 > zones by about 17% btw), and generally work fine.> 
> Except that they crash pretty regularly anywhere inbetween 6 and 10 days.
What version of Opensolaris are you using?  Are you using stock Xen bits (that
come with opensolaris)? The bug looks familiar, but I''ll have to do
some
searching...



MRJ




> I''ve been searching forums, etc. Not sure what to do. 
here''s a log entry:
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 836849 kern.notice] 
> Jul  1 15:15:08 ecw-mysql1 ^Mpanic[cpu0]/thread=ffffff005b7e1c80: 
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 683410 kern.notice] BAD TRAP:
type=e (#pf Page fault) rp=ffffff005b7e1120 addr=fffffe0a3e18ec20
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 100000 kern.notice] 
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 839527 kern.notice] sched: 
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 753105 kern.notice] #pf Page fault
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 532287 kern.notice] Bad kernel fault
at addr=0xfffffe0a3e18ec20
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 243837 kern.notice] pid=0,
pc=0xfffffffffb8a0663, sp=0xffffff005b7e1218, eflags=0x10246
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 211416 kern.notice] cr0:
8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 2660<vmxe,xmme,fxsr,mce,pae>
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 624947 kern.notice] cr2:
fffffe0a3e18ec20
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 100000 kern.notice] 
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        rdi:
fffffe0a3e18ec20 rsi:                0 rdx:         e0508673
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        rcx:       
3  r8:                0  r9: ffffff0cb9384000
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        rax:       
0 rbx:         e0508673 rbp: ffffff005b7e12b0
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        r10:       
0 r11: ffffff0000002000 r12:                0
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        r13:       
1 r14: fffffe0a3e18ec20 r15:         e0508673
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        fsb:       
0 gsb: fffffffffbc5ef70  ds:               4b
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]         es:       
4b  fs:                0  gs:              1c3
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]        trp:       
e err:                3 rip: fffffffffb8a0663
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 592667 kern.notice]         cs:       
e030 rfl:            10246 rsp: ffffff005b7e1218
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 266532 kern.notice]         ss:       
e02b
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 100000 kern.notice] 
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1000 unix:die+10f ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1110 unix:trap+1768 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1120 unix:_cmntrap+12f ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e12b0 unix:atomic_cas_ptr+3 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1350 unix:hati_pte_map+160 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e13d0 unix:hati_load_common+15d ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1490 unix:hat_devload+15d ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e14f0 rootnex:rootnex_map_regspec+151 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e15a0 rootnex:rootnex_map+141 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e15f0 genunix:ddi_map+51 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e16e0 npe:npe_bus_map+43d ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1720 pcie_pci:pepb_bus_map+31 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1760 pcie_pci:pepb_bus_map+31 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e17b0 genunix:ddi_map+51 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1870 genunix:ddi_regs_map_setup+d5 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e18c0 genunix:pci_config_setup+69 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1900 pcie:pcie_init_bus+41 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1a30 pcie_pci:pepb_initchild+bc ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1ab0 pcie_pci:pepb_ctlops+276 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1af0 genunix:init_node+78 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1b30 genunix:i_ndi_config_node+fa ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1b60 genunix:i_ndi_init_hw_children+48 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1bc0 genunix:config_immediate_children+83 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1c10 genunix:devi_config_common+a6 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1c60 genunix:mt_config_thread+53 ()
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 655072 kern.notice]
ffffff005b7e1c70 unix:thread_start+8 ()
> Jul  1 15:15:08 ecw-mysql1 unix: [ID 100000 kern.notice] 
> Jul  1 15:15:08 ecw-mysql1 genunix: [ID 672855 kern.notice] syncing file
systems...
> Jul  1 15:15:09 ecw-mysql1 genunix: [ID 904073 kern.notice]  done
> Jul  1 15:15:10 ecw-mysql1 genunix: [ID 111219 kern.notice] dumping to
/dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
> Jul  1 15:17:51 ecw-mysql1 genunix: [ID 409368 kern.notice] ^M100% done:
1588175 pages dumped, compression ratio 3.44,
> Jul  1 15:17:51 ecw-mysql1 genunix: [ID 851671 kern.notice] dump succeeded
> 
> 
> 
> Anyway, sometimes they blame xVM hypervisor for the crash, sometimes not. 
I''ve got twin Dell 2900s, have moved the domainUs from one machine to
the other, same results.
> 
> Name                                        ID   Mem VCPUs      State  
Time(s)
> Def                                          1024     1                 0.0
> Domain-0                                     0 34154     8     r-----  
3871.7
> EDB_Bs                                     1  1024     1     -b----   
803.8
> EDB_Faare                         8  2048     2     -b----   1099.3
> EDB_Gral                     7  1024     1     -b----    185.2
> EDB_NC                                    6  1024     1     -b----    290.0
> EDB_Tg                                 2  1024     1     -b----     45.0
> EDB_Way                              3  1024     1     -b----     62.3
> EDB_Wel                                   5  1024     1     -b----    278.2
> EDB_Wnd                                 9  1024     1     -b----    306.9
> EHX_Dbase                                10  4096     1     -b----     51.3
> Iine                              4  1024     1     -b----     76.0
> Repair                                                   512     1         
13.1
> 
> Anyway the crashed occur when the Time(s) for any one domainU gets up
around 25000 or so.  These are production databases, so they do get a lot of
work.
> 
> Anyway, it''s aggravating when  the servers die like that, but zfs
is there helping out, so that''s nice.
> no idea if any of this makes sense, it''s late, and I''m
not too concerned about it anymore, however any help would be great!
> thanks,
> Jack

Jack

2009-Jul-03 19:05 UTC

head link

Re: RHEL 5.2 crash OpenSol after a while

Okay, the version of OpenSolaris is: 
Sun Microsystems Inc.   SunOS 5.11      snv_101b        November 2008
xVM is the 3.1 (whatever is stock with OSOL)
I don''t have anything unusual.

The hardware is as such:
Dell 2900, 48G ram, Intel(r) Xeon(r) CPU E5420  @ 2.50GHz
Dual 250G sata drives for the rpool,
8 450G sas drives for the datastore pool.  
It is really fast, does a great job with the exception of the crashing.
And yes, I agree, I''d just run MySQL on Solaris if I could. 
It''s the vendor that''s the moron, not me ;)
eClinicalWorks is so backwards they think that RHEL is a better platform for
MySQL than OSOL, and they have it in their heads that x86 virtualization is
somehow superior to zones.  I fought this for about 6 months, but they are
threatening to pull our contract, so I have to yield to the RHEL madness. 
Anyway, that''s why I don''t mind it crashing...  it''s
their fault, eCW is filled with idiocy. It''s a bit slower now on the
RHEL ontop of xVM ontop of Solaris - 17% or so, but whatever.
</rant>

Okay, so if there is anything I can do to improve the product, I''d be
glad to.  Consider this a good test platform as this is real data (9 active
clinics) with a real load.  It''s never working the cpu hard at all,
just the drives get hit pretty good.  eClinicalWorks uses full joins for about
80% of their db queries...
-- 
This message posted from opensolaris.org

Jack

2009-Jul-03 19:12 UTC

head link

Re: RHEL 5.2 crash OpenSol after a while

> This doesn''t have anything to do with your problem...
> But I''m curious, what do you have for CPUs?
> Are you limiting dom0 memory on boot? What about
> the number of CPUs dom0 can use? e.g.
> kernel /boot/amd64/xen.gz com1=9600,8n1 console=com1 dom0_mem=2g
dom0_max_vcpus=2 dom0_vcpus_pin=true
Okay, this is a Intel(r) Xeon(r) CPU E5420  @ 2.50GHz based system, there are 8
cores (twin quadcores).

No, I''m not doing anything to the boot of xVM.  I used to do that on my
linux versions of xen, but got out of the habit when I saw xVM managing it okay
on it''s own.  Plus, I didn''t read where they wanted that to
happen.  Guess I''ll give that a shot.  But what about ZFS? 
doesn''t it want just a ton of memory?  Seems like I should have a bunch
around for that.  That''s why I asked for 48G ram on these machines ;)

this is my grub entry:

title Solaris xVM
findroot (pool_rpool,0,a)
kernel$ /boot/$ISADIR/xen.gz
module$ /platform/i86xpv/kernel/$ISADIR/unix
/platform/i86xpv/kernel/$ISADIR/unix -B $ZFS-BOOTFS
module$ /platform/i86pc/$ISADIR/boot_archive

> What version of Opensolaris are you using? Are you using stock Xen bits
(that
> come with opensolaris)? The bug looks familiar, but I''ll have to
do some searching...
Sun Microsystems Inc.   SunOS 5.11      snv_101b        November 2008
and yes, stock xVM.

Now, I''ve tried xVM on another 2900 configured identically to this one
with the Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
version - have it running right now, and performance is really bad there. 
It''s so slow it''s unreal, so I just quit on that and went back
to the one that worked pretty well.  Boot times on the newer 0906 version are
around 20 minutes for rhel5.2 and centos 5.3 won''t even boot after
being installed.  That''s why I quit there - had something that worked,
why complain ;)
-- 
This message posted from opensolaris.org

Fajar A. Nugraha

2009-Jul-03 23:29 UTC

head link

Re: RHEL 5.2 crash OpenSol after a while

On Sat, Jul 4, 2009 at 2:05 AM, Jack<no-reply@opensolaris.org>
wrote:> Okay, the version of OpenSolaris is:
> Sun Microsystems Inc.   SunOS 5.11      snv_101b        November 2008
That''s pretty old :P
I suggest you upgrade to the latest bits (117), but try it on dev
servers first. One of the things to watch out is that opensolaris >snv_105
uses crossbow, which might give you problems if you use vlans
(as in you need some config adjustments).

-- 
Fajar

Mark Johnson

2009-Jul-07 14:04 UTC

head link

Re: RHEL 5.2 crash OpenSol after a while

Jack wrote:>> This doesn''t have anything to do with your problem...
>> But I''m curious, what do you have for CPUs?
>> Are you limiting dom0 memory on boot? What about
>> the number of CPUs dom0 can use? e.g.
> 
>> kernel /boot/amd64/xen.gz com1=9600,8n1 console=com1 dom0_mem=2g
dom0_max_vcpus=2 dom0_vcpus_pin=true
> 
> Okay, this is a Intel(r) Xeon(r) CPU E5420  @ 2.50GHz based system, there
are 8 cores (twin quadcores).
> 
> No, I''m not doing anything to the boot of xVM.  I used to do that
on my linux versions > of xen, but got out of the habit when I saw xVM managing it okay on
it''s own.  Plus,
 > I didn''t read where they wanted that to happen.  Guess
I''ll give that a shot.  But
 > what about ZFS?  doesn''t it want just a ton of memory?  Seems
like I should have a
 > bunch around for that.  That''s why I asked for 48G ram on these
machines ;)
 >> this is my grub entry:
> 
> title Solaris xVM
> findroot (pool_rpool,0,a)
> kernel$ /boot/$ISADIR/xen.gz
> module$ /platform/i86xpv/kernel/$ISADIR/unix
/platform/i86xpv/kernel/$ISADIR/unix -B $ZFS-BOOTFS
> module$ /platform/i86pc/$ISADIR/boot_archive
you should limit the dom0 cpus to 2-4 depending on your
load (mpstat) when doing a lot of IO. Even better if you
can isolate dom0 CPUs from the guests. This is easier if
you have a really large system of course :-)  You also
should limit the dom0 memory so you don''t balloon dom0
memory down.  For your setup, I would think 16g should
be plenty... But you won''t know for sure until you try :-)
   kernel$ /boot/$ISADIR/xen.gz com1=9600,8n1 console=com1 dom0_mem=16g
dom0_max_vcpus=4 dom0_vcpus_pin=true

As a safety net, you could restrict dom0 ballooning
so you don''t accidentally take away its memory.

   svccfg -s xvm/xend setprop config/dom0-min-mem=16000
   svcadm refresh xvm/xend;svcadm restart xvm/xend

Since your using zfs in dom0, you should limit the
size of the arc. I would start with 1/2 the memory
in your dom0 if you have >= 4G. e.g.

   echo "set zfs:zfs_arc_max = 0x200000000" >> /etc/system



> 
>> What version of Opensolaris are you using? Are you using stock Xen bits
(that
>> come with opensolaris)? The bug looks familiar, but I''ll have
to do some searching...
> 
> Sun Microsystems Inc.   SunOS 5.11      snv_101b        November 2008
> and yes, stock xVM.
> 
> Now, I''ve tried xVM on another 2900 configured identically to this
one with the > Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
version -
 > have it running right now, and performance is really bad there. 
It''s so slow
 > it''s unreal, so I just quit on that and went back to the one that
worked pretty
 > well.  Boot times on the newer 0906 version are around 20 minutes for
rhel5.2
 > and centos 5.3 won''t even boot after being installed. 
That''s why I quit
 > there - had something that worked, why complain ;)

The centos5.3 problem is likely
   http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6836480

But you shouldn''t be seeing things go slow like that?  Can you try the
above menu.lst changes and arc limits to see if it still runs slow?
You can copy the xdb driver from b118 and it should fix the centos5.3
problem your seeing.

What kind of disk is the boot disk (sata)? If sata, I assume it''s
not running in IDE mode?



Thanks,

MRJ

Jack

2009-Jul-08 15:39 UTC

head link

Re: RHEL 5.2 crash OpenSol after a while

"You also should limit the dom0 memory so you don''t balloon dom0
memory down"

So, does that cause problems to have the full allocation of ram available and
then as you add VMs, it disappear?  I''ve not yet implemented your
changes, but will do so shortly (life and work, both maxed), and report the
results back.  Should be interesting!

thanks,
jdownes
-- 
This message posted from opensolaris.org

Mark Johnson

2009-Jul-08 18:37 UTC

head link

Re: RHEL 5.2 crash OpenSol after a while

Jack wrote:> "You also should limit the dom0 memory so you don''t balloon
dom0 memory down"
> 
> So, does that cause problems to have the full allocation of ram available  > and then as you add VMs, it disappear?

It does not interact well with zfs...


MRJ

> I''ve not yet implemented your  > changes, but will do so shortly (life and work, both maxed), and report
 > the results back.  Should be interesting!> 
> thanks,
> jdownes

Jack

2009-Jul-09 06:21 UTC

head link

Re: RHEL 5.2 crash OpenSol after a while

Okay, I''ve done this:
kernel$ /boot/$ISADIR/xen.gz dom0_mem=16g dom0_max_vcpus=4 dom0_vcpus_pin=true

I removed the com port stuff assuming that to be serial console - we
don''t even have one here, just use ssh and KVMs.

also, since I''m using 16G for main memory of domain0, I followed your
instruction to use half for the arc:

set zfs:zfs_arc_max = 0x800000000  

is in the /etc/system... hopefully that means use 8G for the arc ;)
-- 
This message posted from opensolaris.org

Mark Johnson

2009-Jul-09 14:27 UTC

head link

Re: RHEL 5.2 crash OpenSol after a while

Jack wrote:> Okay, I''ve done this:
> kernel$ /boot/$ISADIR/xen.gz dom0_mem=16g dom0_max_vcpus=4
dom0_vcpus_pin=true
> 
> I removed the com port stuff assuming that to be serial console - we
don''t even have one here, just use ssh and KVMs.
> 
> also, since I''m using 16G for main memory of domain0, I followed
your instruction to use half for the arc:
> 
> set zfs:zfs_arc_max = 0x800000000  
> 
> is in the /etc/system... hopefully that means use 8G for the arc ;)
Nope :-)

   set zfs:zfs_arc_max = 0x200000000

Jack

2009-Jul-09 19:45 UTC

head link

Re: RHEL 5.2 crash OpenSol after a while

hmm, I just checked the arc size with Ben Rockwoods arc_summary script and it
shows:
ARC Size:
         Current Size:             8070 MB (arcsize)
         Target Size (Adaptive):   14292 MB (c)
         Min Size (Hard Limit):    1918 MB (zfs_arc_min)
         Max Size (Hard Limit):    15351 MB (zfs_arc_max)

I made the change to /etc/system just to see if it''d crash by letting
it run like that.  Guess not, didn''t change anything either though. 
Perhaps zfs "knew" what I meant?  Anyway, after the change, I have
very similar numbers.. which is probably just the movement of the arc on
it''s own... never mind the change ;)
ARC Size:
         Current Size:             8175 MB (arcsize)
         Target Size (Adaptive):   15351 MB (c)
         Min Size (Hard Limit):    1918 MB (zfs_arc_min)
         Max Size (Hard Limit):    15351 MB (zfs_arc_max)
-- 
This message posted from opensolaris.org

xen discuss - Jul 2009 - RHEL 5.2 crash OpenSol after a while

RHEL 5.2 crash OpenSol after a while

Re: RHEL 5.2 crash OpenSol after a while

Re: RHEL 5.2 crash OpenSol after a while

Re: RHEL 5.2 crash OpenSol after a while

Re: RHEL 5.2 crash OpenSol after a while

Re: RHEL 5.2 crash OpenSol after a while

Re: RHEL 5.2 crash OpenSol after a while

Re: RHEL 5.2 crash OpenSol after a while

Re: RHEL 5.2 crash OpenSol after a while

Re: RHEL 5.2 crash OpenSol after a while

Re: RHEL 5.2 crash OpenSol after a while

Re: RHEL 5.2 crash OpenSol after a while