hi, I am trying to make sense of the various stack-pointers within Xen. My problem is that my newly created domain gets killed, apparently because Xen cannot write the stack when returning to the domain. I am not sure which stack pointer Xen is using for the faulting write though. I am looking at two different esp values, the ones printed by the printf below: execution_context_t ctxt; memcpy(&ctxt, get_execution_context(),sizeof(execution_context_t)); printk("esp1 %08lx %08lx\n",ctxt.esp, current->thread.esp1); What is the difference between these two stacks, and what is the recommended way of reading and setting their values from within the unprivileged domain? thanks, Jacob ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Tue, 2004-01-20 at 12:11, Jacob Gorm Hansen wrote:> execution_context_t ctxt; > memcpy(&ctxt, get_execution_context(),sizeof(execution_context_t)); > > printk("esp1 %08lx %08lx\n",ctxt.esp, current->thread.esp1);Just to add a little more info; my problem seems to be that the value of ctxt.esp from read from inside Xen differs from what I set at domain creation time, and the value I get from reading the esp from within the domain, by about 0x4000 bytes negative. I have tried setting the value of esp1 with the stack_switch hypercall, but apparently this is not the one causing the Xen page fault. best regards, Jacob ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> On Tue, 2004-01-20 at 12:11, Jacob Gorm Hansen wrote: > > > execution_context_t ctxt; > > memcpy(&ctxt, get_execution_context(),sizeof(execution_context_t)); > > > > printk("esp1 %08lx %08lx\n",ctxt.esp, current->thread.esp1);So current->thread.esp1 is the stack pointer for ring 1 for the current domain (aka the ''kernel'' stack pointer for xenolinux). The equivalent ''stored'' version of this is the ''ring1_esp'' field in a full_execution_context_t. ctxt is an ''execution_context_t'' which holds the user level (ring 3) stack pointer (and other registers).> I have tried setting the value of esp1 with the stack_switch hypercall, > but apparently this is not the one causing the Xen page fault.The stack_switch hypercall is typically used by GuestOS to tell tell Xen what it''s "kernel" (ring 1) stack (segment + sp) are. It''s a bit like a ''virtual TSS'' for each domain (since the ss/sp for ring 1 are updated from this through each domain schedule). For its use in XenoLinux see ./xenolinux-2.4.24-sparse/arch/xeno/kernel/process.c:__switch_to() The initial ring 1 ss/sp come from the full_execution_context_t in the builddomain_t (see ./xen/common/domain.c:final_setup_guestos()) Anyway, can you post - a) what it is you''re trying to do in detail (I''m guessing it''s to do with migration but not sure what stage you''re at) and b) the console output leading up to your crash (xen and XL output if possible/relevant). You might also like to look at ./tools/xc/lib/xc_linux_{save,restore}.c to see how it works/ed in our version. cheers, S. ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Tue, 2004-01-20 at 13:25, Steven Hand wrote:> > On Tue, 2004-01-20 at 12:11, Jacob Gorm Hansen wrote: > > Anyway, can you post - > > a) what it is you''re trying to do in detail (I''m guessing it''s > to do with migration but not sure what stage you''re at) andI have migrated the domain pages to a new domain, and am trying to get it to resume after migration. I am currently crashing sometime shortly after resumption, and I have instrumented Xen to dump some info about the state of the domain. Since I am reading all info from inside the old domain ''by hand'', this is probably a case of some CPU state I have not managed to get across. I am quite sure the crash is due to a wrong stack page being pointed to, but since my recovery code is running in ring1 in a __cli() context, I suppose the ring3 stack cannot be to blame. best, Jacob
On Tue, 2004-01-20 at 15:42, Jacob Gorm Hansen wrote:> On Tue, 2004-01-20 at 13:25, Steven Hand wrote: > > > On Tue, 2004-01-20 at 12:11, Jacob Gorm Hansen wrote: > > > > Anyway, can you post - > > > > a) what it is you''re trying to do in detail (I''m guessing it''s > > to do with migration but not sure what stage you''re at) andThis is my recovery function, which eip points to when the new domain is started: static void recover(void) { __cli(); HYPERVISOR_stack_switch(__KERNEL_DS, current->thread.esp0); asm volatile("addl $0x0, -4(%%eax)" : :"eax"(current->thread.esp0)); while(1) HYPERVISOR_console_write("alive",5); ... I touch the ring1 stack to make sure it is writeable (due to my migration hacks it may not always be). The output on the serial looks like below. The first four lines are printed by my version of Xen as a response to SCHEDOP_exit: exit dom 41 : esp1 c3a96000, ss 00000821 eip c00b5b6d esp c3a95ed8 eflags 296 Killing domain 41 Releasing task 41 DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive fault_in_hypervisor 2 dom 42 : esp1 c3a96000, ss 00000821 eip 00000000 esp c3a92004 eip 00000000 pf-addr c3a91ff8 eflags 10286 Killing domain 42 Releasing task 42 As you can see, the crash is not happening in direct response to some action in the domain, but rather as an effect of something happening outside. I was speculating that perhaps I need to re-register for the timer interrupt, or that the __cli() does not prevent Xen from trying to deliver them? Btw, the value of ''current'' checks out, and is equal to the value before migration. best, Jacob
On Tue, 2004-01-20 at 15:42, Jacob Gorm Hansen wrote:> On Tue, 2004-01-20 at 13:25, Steven Hand wrote: > > > On Tue, 2004-01-20 at 12:11, Jacob Gorm Hansen wrote: > > > > Anyway, can you post - > > > > a) what it is you''re trying to do in detail (I''m guessing it''s > > to do with migration but not sure what stage you''re at) andThis is my recovery function, which eip points to when the new domain is started: static void recover(void) { __cli(); HYPERVISOR_stack_switch(__KERNEL_DS, current->thread.esp0); asm volatile("addl $0x0, -4(%%eax)" : :"eax"(current->thread.esp0)); while(1) HYPERVISOR_console_write("alive",5); ... I touch the ring1 stack to make sure it is writeable (due to my migration hacks it may not always be). The output on the serial looks like below. The first four lines are printed by my version of Xen as a response to SCHEDOP_exit: exit dom 41 : esp1 c3a96000, ss 00000821 eip c00b5b6d esp c3a95ed8 eflags 296 Killing domain 41 Releasing task 41 DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive DOM42: alive fault_in_hypervisor 2 dom 42 : esp1 c3a96000, ss 00000821 eip 00000000 esp c3a92004 eip 00000000 pf-addr c3a91ff8 eflags 10286 Killing domain 42 Releasing task 42 As you can see, the crash is not happening in direct response to some action in the domain, but rather as an effect of something happening outside. I was speculating that perhaps I need to re-register for the timer interrupt, or that the __cli() does not prevent Xen from trying to deliver them? Btw, the value of ''current'' checks out, and is equal to the value before migration. best, Jacob ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> On Tue, 2004-01-20 at 15:42, Jacob Gorm Hansen wrote: > > On Tue, 2004-01-20 at 13:25, Steven Hand wrote: > > > > On Tue, 2004-01-20 at 12:11, Jacob Gorm Hansen wrote: > > > > > > Anyway, can you post - > > > > > > a) what it is you''re trying to do in detail (I''m guessing it''s > > > to do with migration but not sure what stage you''re at) and > > > This is my recovery function, which eip points to when the new domain is > started: > > static void recover(void) > { > __cli(); > HYPERVISOR_stack_switch(__KERNEL_DS, current->thread.esp0); > asm volatile("addl $0x0, -4(%%eax)" : :"eax"(current->thread.esp0)); > while(1) HYPERVISOR_console_write("alive",5); > > ... > > I touch the ring1 stack to make sure it is writeable (due to my > migration hacks it may not always be). > > The output on the serial looks like below. The first four lines are > printed by my version of Xen as a response to SCHEDOP_exit: > > exit dom 41 : esp1 c3a96000, ss 00000821 > eip c00b5b6d esp c3a95ed8 eflags 296 > Killing domain 41 > Releasing task 41 > DOM42: alive > DOM42: alive > DOM42: alive > DOM42: alive > DOM42: alive > DOM42: alive > DOM42: alive > DOM42: alive > DOM42: alive > DOM42: alive > DOM42: alive > DOM42: alive > DOM42: alive > fault_in_hypervisor 2 > dom 42 : esp1 c3a96000, ss 00000821 > eip 00000000 esp c3a92004 eip 00000000 pf-addr c3a91ff8 eflags 10286 > Killing domain 42 > Releasing task 42 > > As you can see, the crash is not happening in direct response to some > action in the domain, but rather as an effect of something happening > outside. I was speculating that perhaps I need to re-register for the > timer interrupt, or that the __cli() does not prevent Xen from trying to > deliver them?The __cli() should certainly prevent any events from being delivered. It''s tricky to work out what the above means as you''ve clearly hacked Xen to e.g print "fault_in_hypervisor 2" (is 2 the error_code? is there any reason you''ve added this stuff in place of the regular code in do_trap() or do_page_fault()?). Can you post the code/diffs for these parts of xen? What is at 0xc3a91ff8? Have you actually taken a page fault? I note that zero eips are not so good -- but OTOH I don''t know what those values you print actually are [partic given there are two eips]... plus if we''re really multiply faulting in the hypervisor, all bets may well be off... cheers, S.
On Tue, 2004-01-20 at 18:08, Steven Hand wrote:> The __cli() should certainly prevent any events from being delivered. > > It''s tricky to work out what the above means as you''ve clearly hacked > Xen to e.g print "fault_in_hypervisor 2" (is 2 the error_code? is > there any reason you''ve added this stuff in place of the regular code > in do_trap() or do_page_fault()?). Can you post the code/diffs for > these parts of xen?I added this because there otherwise Xen will just kill the domain silently, unless there is a debugging option I have overlooked. about line 329 in traps.c: fault_in_hypervisor: + printk("fault_in_hypervisor 2\n"); + printk("dom %d : esp1 %08lx, ss %08lx\n",current->domain, current->thread.esp1, current->thread.ss1); + + execution_context_t ctxt; + memcpy(&ctxt, + get_execution_context(), + sizeof(execution_context_t)); + + printk("eip %p esp %p eip %p addr %p flags %x\n", ctxt.eip, ctxt.esp, gtb->eip, addr, ctxt.eflags);> > What is at 0xc3a91ff8? Have you actually taken a page fault? I note > that zero eips are not so good -- but OTOH I don''t know what those > values you print actually are [partic given there are two eips]... > plus if we''re really multiply faulting in the hypervisor, all bets > may well be off...0xc3a91ff8 appears to be where ctxt.esp is pointing if you push some values. The dual 0 eips worry me as well, I have not registered any event-callbacks for the domain, perhaps that is the reason? But should I not be allowed to run without registering for interrupts in a brand-new domain? To me this looks like an interrupt-delivery gone bad. cheers, Jacob
> about line 329 in traps.c: > > fault_in_hypervisor: > + printk("fault_in_hypervisor 2\n"); > + printk("dom %d : esp1 %08lx, ss %08lx\n",current->domain, current->thread.esp1, current->thread.ss1); > + > + execution_context_t ctxt; > + memcpy(&ctxt, > + get_execution_context(), > + sizeof(execution_context_t)); > + > + printk("eip %p esp %p eip %p addr %p flags %x\n", ctxt.eip, ctxt.esp, gtb->eip, addr, ctxt.eflags);The EIP/ESP values you are printing aren''t up to date. You should be printing regs->eip and regs->esp. Also something like: struct pt_regs *guest_regs = (struct pt_regs *)(current->thread.esp1-1); <print guest_regs->esp, guest_regs->eip>> 0xc3a91ff8 appears to be where ctxt.esp is pointing if you push some > values. The dual 0 eips worry me as well, I have not registered any > event-callbacks for the domain, perhaps that is the reason? But should I > not be allowed to run without registering for interrupts in a brand-new > domain? To me this looks like an interrupt-delivery gone bad.Maybe an exception of some kind? Did you fill in the trap_table (virtual IDT) in full_execution_context? -- Keir
On Tue, 2004-01-20 at 18:08, Steven Hand wrote:> > On Tue, 2004-01-20 at 15:42, Jacob Gorm Hansen wrote: > > > On Tue, 2004-01-20 at 13:25, Steven Hand wrote: > > > > > On Tue, 2004-01-20 at 12:11, Jacob Gorm Hansen wrote: > > > > > > > > Anyway, can you post - > > > > > > > > a) what it is you''re trying to do in detail (I''m guessing it''s > > > > to do with migration but not sure what stage you''re at) andAdding the following to my recover() code changed things a bit, probably more Xen state I need to restore before I can live happily ever after: HYPERVISOR_set_trap_table(trap_table); HYPERVISOR_set_fast_trap(SYSCALL_VECTOR); Digging in. While I understand that this may mean I now have various nice things such as a page fault handler, I am still puzzled how my infinite loop could crash like that. Jacob
> On Tue, 2004-01-20 at 18:08, Steven Hand wrote: > > > On Tue, 2004-01-20 at 15:42, Jacob Gorm Hansen wrote: > > > > On Tue, 2004-01-20 at 13:25, Steven Hand wrote: > > > > > > On Tue, 2004-01-20 at 12:11, Jacob Gorm Hansen wrote: > > > > > > > > > > Anyway, can you post - > > > > > > > > > > a) what it is you''re trying to do in detail (I''m guessing it''s > > > > > to do with migration but not sure what stage you''re at) and > > > Adding the following to my recover() code changed things a bit, probably > more Xen state I need to restore before I can live happily ever after: > > HYPERVISOR_set_trap_table(trap_table); > HYPERVISOR_set_fast_trap(SYSCALL_VECTOR);The alternative is to copy this info between the full_execution_context''s of the old and new domains.> Digging in. While I understand that this may mean I now have various > nice things such as a page fault handler, I am still puzzled how my > infinite loop could crash like that.Looks weird. Why not instrument Xenolinux''s trap handlers to see which exception you are occasionally taking. It''s not hard -- most go thru do_trap() in arch/xeno/kernel/traps.c. GPFs and page faults go thru separate specialised functions. -- keir
On Tue, 2004-01-20 at 19:24, Keir Fraser wrote:> > HYPERVISOR_set_trap_table(trap_table); > > HYPERVISOR_set_fast_trap(SYSCALL_VECTOR); > > The alternative is to copy this info between the > full_execution_context''s of the old and new domains.Except that with my setup I do not have access to these, unless there is a way to map them from within each unprivileged domain.> Looks weird. Why not instrument Xenolinux''s trap handlers to see which > exception you are occasionally taking. It''s not hard -- most go thru > do_trap() in arch/xeno/kernel/traps.c. GPFs and page faults go thru > separate specialised functions.It seems with a little more wiggling of things (especially installing the trap vectors _before_ touching the ring1 stack of current) makes me run quite a bit further, though not really far. thanks, Jacob
On Tue, 2004-01-20 at 19:19, Keir Fraser wrote:> > about line 329 in traps.c: > > > Maybe an exception of some kind? Did you fill in the trap_table > (virtual IDT) in full_execution_context?Hi, I cleaned up my page table remapping code, I am pretty confident it is correct now (and the errors from Xen are fixed, thanks Steven). However, I am still having the problem of an exception occuring apparently right as the domain starts. I have tried copying the trap_table across and installing it while creating the domain, but this has no effect -- for instance I see lots of page faults (in ret_from_sys_call (xenolinux)) if I print them from Xen, but the xenolinux pf handler is never reached. Do I have to do anything more than just copy the trap_table into the full-exe-ctxt before domain creation? Are the handler addresses in virtual coordinates? I have instrumented the GPF and general trap handlers in Xen, but they are not called. Perhaps I should just add a hypercall to make Xen dump the exe-context in user space? Jacob