David Vrabel
2013-Jun-24 17:42 UTC
[PATCHv6 0/10] kexec: extend kexec hypercall for use with pv-ops kernels
The series (for Xen 4.4) improves the kexec hypercall by making Xen responsible for loading and relocating the image. This allows kexec to be usable by pv-ops kernels and should allow kexec to be usable from a HVM or PVH privileged domain. The first patch is a simple clean-up. The second patch allows hypercall structures to be ABI compatible between 32- and 64-bit guests (by reusing stuff present for domctls and sysctls). This seems better than having to keep adding compat handling for new hypercalls etc. Patch 3 introduces the new ABI. Patch 4 and 5 nearly completely reimplement the kexec load, unload and exec sub-ops. The old load_v1 sub-op is then implemented on top of the new code. Patch 6 calls the kexec image when dom0 crashes. This avoids having to alter dom0 kernels to do a exec sub-op call on crash -- a SHUTDOWN_crash by dom0 will trigger the kexec. Patches 7 and 8 add the libxc API for the kexec calls. These have been acked-by Ian Campbell already. Patch 9 adds a link time check for the size of the relocate code. Patch 10 adds myself as the maintainer for kexec in Xen. The required patch series for kexec-tools have previously been posted and this series has been rebased on the latest kexec-tools and is available from the xen-v3 branch of: http://xenbits.xen.org/gitweb/?p=people/dvrabel/kexec-tools.git;a=summary Changes since v5: - Fix double free in KEXEC_load_v1 failure path. - Only copy the relocation code and not the whole page. - Add myself as the kexec maintainer. Changes since v4 (v5 was not posted to the list): - _rsvd -> _pad in one of the public ABI structures. - Fix bug where trailing pages were not zeroed. This fixes loading a 64-bit Linux kernel using a more recent version of kexec-tools. - Check the relocation code fits into a page at link time. Changes since v3: - Use paddr_t and page_to_maddr() etc. for portability. - Add explicit padding to hypercall structures where required. - Minor cleanup of the kexec_reloc assembly. - Print a message before exec''ing a crash image. - Style fixes (tabs, trailing whitespace) and typos. - Fix a bug where using the V1 interface and unloading a image may crash. Changes since v2: - Provide old struct xen_kexec_load if __XEN_INTERFACE_VERSION__ < 4.3 - Adjust new struct xen_kexec_load to avoid unnecessary padding. - Use domheap pages for the image and control pages. - Remove the DBG() macros from the reloc code. David
David Vrabel
2013-Jun-24 17:42 UTC
[PATCH 01/10] x86: give FIX_EFI_MPF its own fixmap entry
From: David Vrabel <david.vrabel@citrix.com> FIX_EFI_MPF was the same as FIX_KEXEC_BASE_0 which is going away. So add its own entry. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> --- xen/arch/x86/mpparse.c | 2 -- xen/include/asm-x86/fixmap.h | 1 + 2 files changed, 1 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/mpparse.c b/xen/arch/x86/mpparse.c index 97d34bc..3753704 100644 --- a/xen/arch/x86/mpparse.c +++ b/xen/arch/x86/mpparse.c @@ -538,8 +538,6 @@ static inline void __init construct_default_ISA_mptable(int mpc_default_type) } } -#define FIX_EFI_MPF FIX_KEXEC_BASE_0 - static __init void efi_unmap_mpf(void) { if (efi_enabled) diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h index d026d78..2eefcf4 100644 --- a/xen/include/asm-x86/fixmap.h +++ b/xen/include/asm-x86/fixmap.h @@ -71,6 +71,7 @@ enum fixed_addresses { FIX_APEI_RANGE_BASE, FIX_APEI_RANGE_END = FIX_APEI_RANGE_BASE + FIX_APEI_RANGE_MAX -1, FIX_IGD_MMIO, + FIX_EFI_MPF, __end_of_fixed_addresses }; -- 1.7.2.5
David Vrabel
2013-Jun-24 17:42 UTC
[PATCH 02/10] xen: make GUEST_HANDLE_64() and uint64_aligned_t available everywhere
From: David Vrabel <david.vrabel@citrix.com> GUEST_HANDLE_64() and uint64_aligned_t allow hypercall ABI structures to be identical (binary compatible) for 32 and 64-bit guests. They are currently limited to only being available for use in sysctls and domctls. Relax this restriction so they may be used by any new structures. There is a minimal cost for 32-bit guests on 64-but hypervisors as set_xen_guest_handle() needs to zero the whole field on GUEST_HANDLE_64() handles, but this is expected to be less than the overhead of having to translate compat structures. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> --- xen/include/public/arch-x86/xen-x86_32.h | 4 +--- xen/include/public/xen.h | 13 ++++++++----- 2 files changed, 9 insertions(+), 8 deletions(-) diff --git a/xen/include/public/arch-x86/xen-x86_32.h b/xen/include/public/arch-x86/xen-x86_32.h index 1504191..f6b4f49 100644 --- a/xen/include/public/arch-x86/xen-x86_32.h +++ b/xen/include/public/arch-x86/xen-x86_32.h @@ -91,8 +91,7 @@ #define machine_to_phys_mapping ((unsigned long *)MACH2PHYS_VIRT_START) #endif -/* 32-/64-bit invariability for control interfaces (domctl/sysctl). */ -#if defined(__XEN__) || defined(__XEN_TOOLS__) +/* 32-/64-bit invariability. */ #undef ___DEFINE_XEN_GUEST_HANDLE #define ___DEFINE_XEN_GUEST_HANDLE(name, type) \ typedef struct { type *p; } \ @@ -107,7 +106,6 @@ #define uint64_aligned_t uint64_t __attribute__((aligned(8))) #define __XEN_GUEST_HANDLE_64(name) __guest_handle_64_ ## name #define XEN_GUEST_HANDLE_64(name) __XEN_GUEST_HANDLE_64(name) -#endif #ifndef __ASSEMBLY__ diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h index 3cab74f..9b574c9 100644 --- a/xen/include/public/xen.h +++ b/xen/include/public/xen.h @@ -858,9 +858,14 @@ __DEFINE_XEN_GUEST_HANDLE(uint64, uint64_t); #endif /* !__ASSEMBLY__ */ -/* Default definitions for macros used by domctl/sysctl. */ -#if defined(__XEN__) || defined(__XEN_TOOLS__) - +/* + * Default definitions for 32/64-bit invariant macros. + * + * Use these in ABI structures that should be identical for 32 and + * 64-bit guests. There is some (very small) overhead in using + * XEN_GUEST_HANDLE_64() instead of XEN_GUEST_HANDLE() so avoid for + * very hot paths. + */ #ifndef uint64_aligned_t #define uint64_aligned_t uint64_t #endif @@ -875,8 +880,6 @@ struct xenctl_bitmap { }; #endif -#endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */ - #endif /* __XEN_PUBLIC_XEN_H__ */ /* -- 1.7.2.5
David Vrabel
2013-Jun-24 17:42 UTC
[PATCH 03/10] kexec: add public interface for improved load/unload sub-ops
From: David Vrabel <david.vrabel@citrix.com> Add replacement KEXEC_CMD_load and KEXEC_CMD_unload sub-ops to the kexec hypercall. These new sub-ops allow a priviledged guest to provide the image data to be loaded into Xen memory or the crash region instead of guests loading the image data themselves and providing the relocation code and metadata. The old interface is provided to guests requesting an interface version prior to 4.3. Signed-off: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> --- xen/common/kexec.c | 12 ++++---- xen/include/public/kexec.h | 66 +++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 68 insertions(+), 10 deletions(-) diff --git a/xen/common/kexec.c b/xen/common/kexec.c index 1ba8556..2d34524 100644 --- a/xen/common/kexec.c +++ b/xen/common/kexec.c @@ -734,7 +734,7 @@ static void crash_save_vmcoreinfo(void) #endif } -static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_t *load) +static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load) { xen_kexec_image_t *image; int base, bit, pos; @@ -781,7 +781,7 @@ static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_t *load) static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg) { - xen_kexec_load_t load; + xen_kexec_load_v1_t load; if ( unlikely(copy_from_guest(&load, uarg, 1)) ) return -EFAULT; @@ -793,8 +793,8 @@ static int kexec_load_unload_compat(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg) { #ifdef CONFIG_COMPAT - compat_kexec_load_t compat_load; - xen_kexec_load_t load; + compat_kexec_load_v1_t compat_load; + xen_kexec_load_v1_t load; if ( unlikely(copy_from_guest(&compat_load, uarg, 1)) ) return -EFAULT; @@ -866,8 +866,8 @@ static int do_kexec_op_internal(unsigned long op, else ret = kexec_get_range(uarg); break; - case KEXEC_CMD_kexec_load: - case KEXEC_CMD_kexec_unload: + case KEXEC_CMD_kexec_load_v1: + case KEXEC_CMD_kexec_unload_v1: spin_lock_irqsave(&kexec_lock, flags); if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags)) { diff --git a/xen/include/public/kexec.h b/xen/include/public/kexec.h index 36409ff..5dddbec 100644 --- a/xen/include/public/kexec.h +++ b/xen/include/public/kexec.h @@ -116,12 +116,12 @@ typedef struct xen_kexec_exec { * type == KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH [in] * image == relocation information for kexec (ignored for unload) [in] */ -#define KEXEC_CMD_kexec_load 1 -#define KEXEC_CMD_kexec_unload 2 -typedef struct xen_kexec_load { +#define KEXEC_CMD_kexec_load_v1 1 /* obsolete since 0x00040300 */ +#define KEXEC_CMD_kexec_unload_v1 2 /* obsolete since 0x00040300 */ +typedef struct xen_kexec_load_v1 { int type; xen_kexec_image_t image; -} xen_kexec_load_t; +} xen_kexec_load_v1_t; #define KEXEC_RANGE_MA_CRASH 0 /* machine address and size of crash area */ #define KEXEC_RANGE_MA_XEN 1 /* machine address and size of Xen itself */ @@ -152,6 +152,64 @@ typedef struct xen_kexec_range { unsigned long start; } xen_kexec_range_t; +#if __XEN_INTERFACE_VERSION__ >= 0x00040300 +/* + * A contiguous chunk of a kexec image and it''s destination machine + * address. + */ +typedef struct xen_kexec_segment { + XEN_GUEST_HANDLE_64(const_void) buf; + uint64_t buf_size; + uint64_t dest_maddr; + uint64_t dest_size; +} xen_kexec_segment_t; +DEFINE_XEN_GUEST_HANDLE(xen_kexec_segment_t); + +/* + * Load a kexec image into memory. + * + * For KEXEC_TYPE_DEFAULT images, the segments may be anywhere in RAM. + * The image is relocated prior to being executed. + * + * For KEXEC_TYPE_CRASH images, each segment of the image must reside + * in the memory region reserved for kexec (KEXEC_RANGE_MA_CRASH) and + * the entry point must be within the image. The caller is responsible + * for ensuring that multiple images do not overlap. + */ + +#define KEXEC_CMD_kexec_load 4 +typedef struct xen_kexec_load { + uint8_t type; /* One of KEXEC_TYPE_* */ + uint8_t _pad; + uint16_t arch; /* ELF machine type (EM_*). */ + uint32_t nr_segments; + XEN_GUEST_HANDLE_64(xen_kexec_segment_t) segments; + uint64_t entry_maddr; /* image entry point machine address. */ +} xen_kexec_load_t; +DEFINE_XEN_GUEST_HANDLE(xen_kexec_load_t); + +/* + * Unload a kexec image. + * + * Type must be one of KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH. + */ +#define KEXEC_CMD_kexec_unload 5 +typedef struct xen_kexec_unload { + uint8_t type; +} xen_kexec_unload_t; +DEFINE_XEN_GUEST_HANDLE(xen_kexec_unload_t); + +#else /* __XEN_INTERFACE_VERSION__ < 0x00040300 */ + +#define KEXEC_CMD_kexec_load KEXEC_CMD_kexec_load_v1 +#define KEXEC_CMD_kexec_unload KEXEC_CMD_kexec_unload_v1 +typedef struct xen_kexec_load { + int type; + xen_kexec_image_t image; +} xen_kexec_load_t; + +#endif + #endif /* _XEN_PUBLIC_KEXEC_H */ /* -- 1.7.2.5
David Vrabel
2013-Jun-24 17:42 UTC
[PATCH 04/10] kexec: add infrastructure for handling kexec images
From: David Vrabel <david.vrabel@citrix.com> Add the code needed to handle and load kexec images into Xen memory or into the crash region. This is needed for the new KEXEC_CMD_load and KEXEC_CMD_unload hypercall sub-ops. Much of this code is derived from the Linux kernel. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> --- xen/common/Makefile | 1 + xen/common/kimage.c | 823 ++++++++++++++++++++++++++++++++++++++++++++++ xen/include/xen/kimage.h | 59 ++++ 3 files changed, 883 insertions(+), 0 deletions(-) create mode 100644 xen/common/kimage.c create mode 100644 xen/include/xen/kimage.h diff --git a/xen/common/Makefile b/xen/common/Makefile index 0dc2050..c821bb8 100644 --- a/xen/common/Makefile +++ b/xen/common/Makefile @@ -11,6 +11,7 @@ obj-y += irq.o obj-y += kernel.o obj-y += keyhandler.o obj-$(HAS_KEXEC) += kexec.o +obj-$(HAS_KEXEC) += kimage.o obj-y += lib.o obj-y += memory.o obj-y += multicall.o diff --git a/xen/common/kimage.c b/xen/common/kimage.c new file mode 100644 index 0000000..995ce36 --- /dev/null +++ b/xen/common/kimage.c @@ -0,0 +1,823 @@ +/* + * Kexec Image + * + * Copyright (C) 2013 Citrix Systems R&D Ltd. + * + * Derived from kernel/kexec.c from Linux: + * + * Copyright (C) 2002-2004 Eric Biederman <ebiederm@xmission.com> + * + * This source code is licensed under the GNU General Public License, + * Version 2. See the file COPYING for more details. + */ + +#include <xen/config.h> +#include <xen/types.h> +#include <xen/init.h> +#include <xen/kernel.h> +#include <xen/errno.h> +#include <xen/spinlock.h> +#include <xen/guest_access.h> +#include <xen/mm.h> +#include <xen/kexec.h> +#include <xen/kimage.h> + +#include <asm/page.h> + +/* + * When kexec transitions to the new kernel there is a one-to-one + * mapping between physical and virtual addresses. On processors + * where you can disable the MMU this is trivial, and easy. For + * others it is still a simple predictable page table to setup. + * + * The code for the transition from the current kernel to the + * the new kernel is placed in the control_code_buffer, whose size + * is given by KEXEC_CONTROL_PAGE_SIZE. In the best case only a single + * page of memory is necessary, but some architectures require more. + * Because this memory must be identity mapped in the transition from + * virtual to physical addresses it must live in the range + * 0 - TASK_SIZE, as only the user space mappings are arbitrarily + * modifiable. + * + * The assembly stub in the control code buffer is passed a linked list + * of descriptor pages detailing the source pages of the new kernel, + * and the destination addresses of those source pages. As this data + * structure is not used in the context of the current OS, it must + * be self-contained. + * + * The code has been made to work with highmem pages and will use a + * destination page in its final resting place (if it happens + * to allocate it). The end product of this is that most of the + * physical address space, and most of RAM can be used. + * + * Future directions include: + * - allocating a page table with the control code buffer identity + * mapped, to simplify machine_kexec and make kexec_on_panic more + * reliable. + */ + +/* + * KIMAGE_NO_DEST is an impossible destination address..., for + * allocating pages whose destination address we do not care about. + */ +#define KIMAGE_NO_DEST (-1UL) + +/* + * Offset of the last entry in an indirection page. + */ +#define KIMAGE_LAST_ENTRY (PAGE_SIZE/sizeof(kimage_entry_t) - 1) + + +static int kimage_is_destination_range(struct kexec_image *image, + paddr_t start, paddr_t end); +static struct page_info *kimage_alloc_page(struct kexec_image *image, + paddr_t dest); + +static struct page_info *kimage_alloc_zeroed_page(unsigned memflags) +{ + struct page_info *page; + + page = alloc_domheap_page(NULL, memflags); + if ( page == NULL ) + return NULL; + + clear_domain_page(page_to_mfn(page)); + + return page; +} + +static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry, + unsigned long nr_segments, + xen_kexec_segment_t *segments, uint8_t type) +{ + struct kexec_image *image; + unsigned long i; + int result; + + /* Allocate a controlling structure */ + result = -ENOMEM; + image = xzalloc(typeof(*image)); + if ( !image ) + goto out; + + image->control_page = ~0; /* By default this does not apply */ + image->entry_maddr = entry; + image->type = type; + image->nr_segments = nr_segments; + image->segments = segments; + + INIT_PAGE_LIST_HEAD(&image->control_pages); + INIT_PAGE_LIST_HEAD(&image->dest_pages); + INIT_PAGE_LIST_HEAD(&image->unusable_pages); + + /* + * Verify we have good destination addresses. The caller is + * responsible for making certain we don''t attempt to load + * the new image into invalid or reserved areas of RAM. This + * just verifies it is an address we can use. + * + * Since the kernel does everything in page size chunks ensure + * the destination addresses are page aligned. Too many + * special cases crop of when we don''t do this. The most + * insidious is getting overlapping destination addresses + * simply because addresses are changed to page size + * granularity. + */ + result = -EADDRNOTAVAIL; + for ( i = 0; i < nr_segments; i++ ) + { + paddr_t mstart, mend; + + mstart = image->segments[i].dest_maddr; + mend = mstart + image->segments[i].dest_size; + if ( (mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK) ) + goto out; + } + + /* Verify our destination addresses do not overlap. + * If we allowed overlapping destination addresses + * through very weird things can happen with no + * easy explanation as one segment stops on another. + */ + result = -EINVAL; + for ( i = 0; i < nr_segments; i++ ) + { + paddr_t mstart, mend; + unsigned long j; + + mstart = image->segments[i].dest_maddr; + mend = mstart + image->segments[i].dest_size; + for (j = 0; j < i; j++ ) + { + paddr_t pstart, pend; + pstart = image->segments[j].dest_maddr; + pend = pstart + image->segments[j].dest_size; + /* Do the segments overlap ? */ + if ( (mend > pstart) && (mstart < pend) ) + goto out; + } + } + + /* Ensure our buffer sizes are strictly less than + * our memory sizes. This should always be the case, + * and it is easier to check up front than to be surprised + * later on. + */ + result = -EINVAL; + for ( i = 0; i < nr_segments; i++ ) + { + if ( image->segments[i].buf_size > image->segments[i].dest_size ) + goto out; + } + + /* Page for the relocation code must still be accessible after the + processor has switched to 32-bit mode. */ + result = -ENOMEM; + image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32)); + if ( !image->control_code_page ) + goto out; + + /* Add an empty indirection page. */ + image->entry_page = kimage_alloc_control_page(image, 0); + if ( !image->entry_page ) + goto out; + + image->head = page_to_maddr(image->entry_page); + image->next_entry = 0; + + result = 0; +out: + if ( result == 0 ) + *rimage = image; + else + kimage_free(image); + + return result; + +} + +static int kimage_normal_alloc(struct kexec_image **rimage, paddr_t entry, + unsigned long nr_segments, + xen_kexec_segment_t *segments) +{ + return do_kimage_alloc(rimage, entry, nr_segments, segments, + KEXEC_TYPE_DEFAULT); +} + +static int kimage_crash_alloc(struct kexec_image **rimage, paddr_t entry, + unsigned long nr_segments, + xen_kexec_segment_t *segments) +{ + unsigned long i; + int result; + + /* Verify we have a valid entry point */ + if ( (entry < kexec_crash_area.start) + || (entry > kexec_crash_area.start + kexec_crash_area.size)) + return -EADDRNOTAVAIL; + + /* + * Verify we have good destination addresses. Normally + * the caller is responsible for making certain we don''t + * attempt to load the new image into invalid or reserved + * areas of RAM. But crash kernels are preloaded into a + * reserved area of ram. We must ensure the addresses + * are in the reserved area otherwise preloading the + * kernel could corrupt things. + */ + for ( i = 0; i < nr_segments; i++ ) + { + paddr_t mstart, mend; + + mstart = segments[i].dest_maddr; + mend = mstart + segments[i].dest_size - 1; + /* Ensure we are within the crash kernel limits */ + if ( (mstart < kexec_crash_area.start ) + || (mend > kexec_crash_area.start + kexec_crash_area.size)) + return -EADDRNOTAVAIL; + } + + /* Allocate and initialize a controlling structure */ + result = do_kimage_alloc(rimage, entry, nr_segments, segments, + KEXEC_TYPE_CRASH); + if ( result ) + return result; + + /* Enable the special crash kernel control page allocation + policy. */ + (*rimage)->control_page = kexec_crash_area.start; + + return 0; +} + +static int kimage_is_destination_range(struct kexec_image *image, + paddr_t start, + paddr_t end) +{ + unsigned long i; + + for ( i = 0; i < image->nr_segments; i++ ) + { + paddr_t mstart, mend; + + mstart = image->segments[i].dest_maddr; + mend = mstart + image->segments[i].dest_size; + if ( (end > mstart) && (start < mend) ) + return 1; + } + + return 0; +} + +static void kimage_free_page_list(struct page_list_head *list) +{ + struct page_info *page, *next; + + page_list_for_each_safe(page, next, list) + { + page_list_del(page, list); + free_domheap_page(page); + } +} + +static struct page_info *kimage_alloc_normal_control_page( + struct kexec_image *image, unsigned memflags) +{ + /* Control pages are special, they are the intermediaries + * that are needed while we copy the rest of the pages + * to their final resting place. As such they must + * not conflict with either the destination addresses + * or memory the kernel is already using. + * + * The only case where we really need more than one of + * these are for architectures where we cannot disable + * the MMU and must instead generate an identity mapped + * page table for all of the memory. + * + * At worst this runs in O(N) of the image size. + */ + struct page_list_head extra_pages; + struct page_info *page = NULL; + + INIT_PAGE_LIST_HEAD(&extra_pages); + + /* Loop while I can allocate a page and the page allocated + * is a destination page. + */ + do { + unsigned long mfn, emfn; + paddr_t addr, eaddr; + + page = kimage_alloc_zeroed_page(memflags); + if ( !page ) + break; + mfn = page_to_mfn(page); + emfn = mfn + 1; + addr = page_to_maddr(page); + eaddr = addr + PAGE_SIZE; + if ( kimage_is_destination_range(image, addr, eaddr) ) + { + page_list_add(page, &extra_pages); + page = NULL; + } + } while ( !page ); + + if ( page ) + { + /* Remember the allocated page... */ + page_list_add(page, &image->control_pages); + + /* Because the page is already in it''s destination + * location we will never allocate another page at + * that address. Therefore kimage_alloc_page + * will not return it (again) and we don''t need + * to give it an entry in image->segments[]. + */ + } + /* Deal with the destination pages I have inadvertently allocated. + * + * Ideally I would convert multi-page allocations into single + * page allocations, and add everything to image->dest_pages. + * + * For now it is simpler to just free the pages. + */ + kimage_free_page_list(&extra_pages); + + return page; +} + +static struct page_info *kimage_alloc_crash_control_page(struct kexec_image *image) +{ + /* Control pages are special, they are the intermediaries + * that are needed while we copy the rest of the pages + * to their final resting place. As such they must + * not conflict with either the destination addresses + * or memory the kernel is already using. + * + * Control pages are also the only pags we must allocate + * when loading a crash kernel. All of the other pages + * are specified by the segments and we just memcpy + * into them directly. + * + * The only case where we really need more than one of + * these are for architectures where we cannot disable + * the MMU and must instead generate an identity mapped + * page table for all of the memory. + * + * Given the low demand this implements a very simple + * allocator that finds the first hole of the appropriate + * size in the reserved memory region, and allocates all + * of the memory up to and including the hole. + */ + paddr_t hole_start, hole_end, size; + struct page_info *page; + + page = NULL; + size = PAGE_SIZE; + hole_start = (image->control_page + (size - 1)) & ~(size - 1); + hole_end = hole_start + size - 1; + while ( hole_end <= kexec_crash_area.start + kexec_crash_area.size ) + { + unsigned long i; + + if ( hole_end > kexec_crash_area.start + kexec_crash_area.size ) + break; + /* See if I overlap any of the segments */ + for ( i = 0; i < image->nr_segments; i++ ) + { + paddr_t mstart, mend; + + mstart = image->segments[i].dest_maddr; + mend = mstart + image->segments[i].dest_size - 1; + if ( (hole_end >= mstart) && (hole_start <= mend) ) + { + /* Advance the hole to the end of the segment */ + hole_start = (mend + (size - 1)) & ~(size - 1); + hole_end = hole_start + size - 1; + break; + } + } + /* If I don''t overlap any segments I have found my hole! */ + if ( i == image->nr_segments ) + { + page = maddr_to_page(hole_start); + break; + } + } + if ( page ) + { + image->control_page = hole_end; + clear_domain_page(page_to_mfn(page)); + } + + return page; +} + + +struct page_info *kimage_alloc_control_page(struct kexec_image *image, + unsigned memflags) +{ + struct page_info *pages = NULL; + + switch ( image->type ) + { + case KEXEC_TYPE_DEFAULT: + pages = kimage_alloc_normal_control_page(image, memflags); + break; + case KEXEC_TYPE_CRASH: + pages = kimage_alloc_crash_control_page(image); + break; + } + return pages; +} + +static int kimage_add_entry(struct kexec_image *image, kimage_entry_t entry) +{ + kimage_entry_t *entries; + + if ( image->next_entry == KIMAGE_LAST_ENTRY ) + { + struct page_info *page; + + page = kimage_alloc_page(image, KIMAGE_NO_DEST); + if ( !page ) + return -ENOMEM; + + entries = __map_domain_page(image->entry_page); + entries[image->next_entry] = page_to_maddr(page) | IND_INDIRECTION; + unmap_domain_page(entries); + + image->entry_page = page; + image->next_entry = 0; + } + + entries = __map_domain_page(image->entry_page); + entries[image->next_entry] = entry; + image->next_entry++; + unmap_domain_page(entries); + + + return 0; +} + +static int kimage_set_destination(struct kexec_image *image, + paddr_t destination) +{ + return kimage_add_entry(image, (destination & PAGE_MASK) | IND_DESTINATION); +} + + +static int kimage_add_page(struct kexec_image *image, paddr_t maddr) +{ + return kimage_add_entry(image, (maddr & PAGE_MASK) | IND_SOURCE); +} + + +static void kimage_free_extra_pages(struct kexec_image *image) +{ + kimage_free_page_list(&image->dest_pages); + kimage_free_page_list(&image->unusable_pages); + +} + +static void kimage_terminate(struct kexec_image *image) +{ + kimage_entry_t *entries; + + entries = __map_domain_page(image->entry_page); + entries[image->next_entry] = IND_DONE; + unmap_domain_page(entries); +} + +/* + * Iterate over all the entries in the indirection pages. + * + * Call unmap_domain_page(ptr) after the loop exits. + */ +#define for_each_kimage_entry(image, ptr, entry) \ + for ( ptr = map_domain_page(image->head >> PAGE_SHIFT); \ + (entry = *ptr) && !(entry & IND_DONE); \ + ptr = (entry & IND_INDIRECTION) ? \ + (unmap_domain_page(ptr), map_domain_page(entry >> PAGE_SHIFT)) \ + : ptr + 1 ) + +static void kimage_free_entry(kimage_entry_t entry) +{ + struct page_info *page; + + page = mfn_to_page(entry >> PAGE_SHIFT); + free_domheap_page(page); +} + +void kimage_free(struct kexec_image *image) +{ + kimage_entry_t *ptr, entry; + kimage_entry_t ind = 0; + + if ( !image ) + return; + + kimage_free_extra_pages(image); + for_each_kimage_entry(image, ptr, entry) + { + if ( entry & IND_INDIRECTION ) + { + /* Free the previous indirection page */ + if ( ind & IND_INDIRECTION ) + kimage_free_entry(ind); + /* Save this indirection page until we are + * done with it. + */ + ind = entry; + } + else if ( entry & IND_SOURCE ) + kimage_free_entry(entry); + } + unmap_domain_page(ptr); + + /* Free the final indirection page */ + if ( ind & IND_INDIRECTION ) + kimage_free_entry(ind); + + /* Free the kexec control pages... */ + kimage_free_page_list(&image->control_pages); + xfree(image->segments); + xfree(image); +} + +static kimage_entry_t *kimage_dst_used(struct kexec_image *image, + paddr_t maddr) +{ + kimage_entry_t *ptr, entry; + unsigned long destination = 0; + + for_each_kimage_entry(image, ptr, entry) + { + if ( entry & IND_DESTINATION ) + destination = entry & PAGE_MASK; + else if ( entry & IND_SOURCE ) + { + if ( maddr == destination ) + return ptr; + destination += PAGE_SIZE; + } + } + unmap_domain_page(ptr); + + return NULL; +} + +static struct page_info *kimage_alloc_page(struct kexec_image *image, + paddr_t destination) +{ + /* + * Here we implement safeguards to ensure that a source page + * is not copied to its destination page before the data on + * the destination page is no longer useful. + * + * To do this we maintain the invariant that a source page is + * either its own destination page, or it is not a + * destination page at all. + * + * That is slightly stronger than required, but the proof + * that no problems will not occur is trivial, and the + * implementation is simply to verify. + * + * When allocating all pages normally this algorithm will run + * in O(N) time, but in the worst case it will run in O(N^2) + * time. If the runtime is a problem the data structures can + * be fixed. + */ + struct page_info *page; + paddr_t addr; + + /* + * Walk through the list of destination pages, and see if I + * have a match. + */ + page_list_for_each(page, &image->dest_pages) + { + addr = page_to_maddr(page); + if ( addr == destination ) + { + page_list_del(page, &image->dest_pages); + return page; + } + } + page = NULL; + for (;;) + { + kimage_entry_t *old; + + /* Allocate a page, if we run out of memory give up */ + page = kimage_alloc_zeroed_page(0); + if ( !page ) + return NULL; + addr = page_to_maddr(page); + + /* If it is the destination page we want use it */ + if ( addr == destination ) + break; + + /* If the page is not a destination page use it */ + if ( !kimage_is_destination_range(image, addr, + addr + PAGE_SIZE) ) + break; + + /* + * I know that the page is someones destination page. + * See if there is already a source page for this + * destination page. And if so swap the source pages. + */ + old = kimage_dst_used(image, addr); + if ( old ) + { + /* If so move it */ + unsigned long old_mfn = *old >> PAGE_SHIFT; + unsigned long mfn = addr >> PAGE_SHIFT; + + copy_domain_page(mfn, old_mfn); + clear_domain_page(old_mfn); + *old = (addr & ~PAGE_MASK) | IND_SOURCE; + unmap_domain_page(old); + + page = mfn_to_page(old_mfn); + break; + } + else + { + /* Place the page on the destination list I + * will use it later. + */ + page_list_add(page, &image->dest_pages); + } + } + + return page; +} + +static int kimage_load_normal_segment(struct kexec_image *image, + xen_kexec_segment_t *segment) +{ + unsigned long to_copy; + unsigned long src_offset; + paddr_t dest, end; + int ret; + + to_copy = segment->buf_size; + src_offset = 0; + dest = segment->dest_maddr; + + ret = kimage_set_destination(image, dest); + if ( ret < 0 ) + return ret; + + while ( to_copy ) + { + unsigned long dest_mfn; + struct page_info *page; + void *dest_va; + size_t size; + + dest_mfn = dest >> PAGE_SHIFT; + + size = min_t(unsigned long, PAGE_SIZE, to_copy); + + page = kimage_alloc_page(image, dest); + if ( !page ) + return -ENOMEM; + ret = kimage_add_page(image, page_to_maddr(page)); + if ( ret < 0 ) + return ret; + + dest_va = __map_domain_page(page); + ret = copy_from_guest_offset(dest_va, segment->buf, src_offset, size); + unmap_domain_page(dest_va); + if ( ret ) + return -EFAULT; + + to_copy -= size; + src_offset += size; + dest += PAGE_SIZE; + } + + /* Remainder of the destination should be zeroed. */ + end = segment->dest_maddr + segment->dest_size; + for ( ; dest < end; dest += PAGE_SIZE ) + kimage_add_entry(image, IND_ZERO); + + return 0; +} + +static int kimage_load_crash_segment(struct kexec_image *image, + xen_kexec_segment_t *segment) +{ + /* For crash dumps kernels we simply copy the data from + * user space to it''s destination. + */ + paddr_t dest; + unsigned long sbytes, dbytes; + int ret = 0; + unsigned long src_offset = 0; + + sbytes = segment->buf_size; + dbytes = segment->dest_size; + dest = segment->dest_maddr; + + while ( dbytes ) + { + unsigned long dest_mfn; + void *dest_va; + size_t schunk, dchunk; + + dest_mfn = dest >> PAGE_SHIFT; + + dchunk = PAGE_SIZE; + schunk = min(dchunk, sbytes); + + dest_va = map_domain_page(dest_mfn); + if ( dest_va == NULL ) + return -EINVAL; + + ret = copy_from_guest_offset(dest_va, segment->buf, src_offset, schunk); + memset(dest_va + schunk, 0, dchunk - schunk); + + unmap_domain_page(dest_va); + if ( ret ) + return -EFAULT; + + dbytes -= dchunk; + sbytes -= schunk; + dest += dchunk; + src_offset += schunk; + } + + return 0; +} + +static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment) +{ + int result = -ENOMEM; + + switch ( image->type ) + { + case KEXEC_TYPE_DEFAULT: + result = kimage_load_normal_segment(image, segment); + break; + case KEXEC_TYPE_CRASH: + result = kimage_load_crash_segment(image, segment); + break; + } + + return result; +} + +int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch, + uint64_t entry_maddr, + uint32_t nr_segments, xen_kexec_segment_t *segment) +{ + int result; + + switch( type ) + { + case KEXEC_TYPE_DEFAULT: + result = kimage_normal_alloc(rimage, entry_maddr, nr_segments, segment); + break; + case KEXEC_TYPE_CRASH: + result = kimage_crash_alloc(rimage, entry_maddr, nr_segments, segment); + break; + default: + result = -EINVAL; + break; + } + if ( result < 0 ) + return result; + + (*rimage)->arch = arch; + + return result; +} + +int kimage_load_segments(struct kexec_image *image) +{ + int s; + int result; + + for ( s = 0; s < image->nr_segments; s++ ) { + result = kimage_load_segment(image, &image->segments[s]); + if ( result < 0 ) + return result; + } + kimage_terminate(image); + return 0; +} + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * tab-width: 4 + * indent-tabs-mode: nil + * End: + */ diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h new file mode 100644 index 0000000..9555688 --- /dev/null +++ b/xen/include/xen/kimage.h @@ -0,0 +1,59 @@ +#ifndef __XEN_KIMAGE_H__ +#define __XEN_KIMAGE_H__ + +#include <xen/list.h> +#include <xen/mm.h> +#include <public/kexec.h> + +#define KEXEC_CONTROL_PAGE_SIZE PAGE_SIZE + +#define KEXEC_SEGMENT_MAX 16 + +typedef paddr_t kimage_entry_t; +#define IND_DESTINATION 0x1 +#define IND_INDIRECTION 0x2 +#define IND_DONE 0x4 +#define IND_SOURCE 0x8 +#define IND_ZERO 0x10 + +struct kexec_image { + uint8_t type; + uint16_t arch; + uint64_t entry_maddr; + uint32_t nr_segments; + xen_kexec_segment_t *segments; + + kimage_entry_t head; + struct page_info *entry_page; + unsigned next_entry; + + struct page_info *control_code_page; + struct page_info *aux_page; + + struct page_list_head control_pages; + struct page_list_head dest_pages; + struct page_list_head unusable_pages; + + /* Address of next control page to allocate for crash kernels. */ + paddr_t control_page; +}; + +int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch, + uint64_t entry_maddr, + uint32_t nr_segments, xen_kexec_segment_t *segment); +void kimage_free(struct kexec_image *image); +int kimage_load_segments(struct kexec_image *image); +struct page_info *kimage_alloc_control_page(struct kexec_image *image, + unsigned memflags); + +#endif /* __XEN_KIMAGE_H__ */ + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * tab-width: 4 + * indent-tabs-mode: nil + * End: + */ -- 1.7.2.5
David Vrabel
2013-Jun-24 17:42 UTC
[PATCH 05/10] kexec: extend hypercall with improved load/unload ops
From: David Vrabel <david.vrabel@citrix.com> In the existing kexec hypercall, the load and unload ops depend on internals of the Linux kernel (the page list and code page provided by the kernel). The code page is used to transition between Xen context and the image so using kernel code doesn''t make sense and will not work for PVH guests. Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops that no longer require a code page to be provided by the guest -- Xen now provides the code for calling the image directly. The new load op looks similar to the Linux kexec_load system call and allows the guest to provide the image data to be loaded. The guest specifies the architecture of the image which may be a 32-bit subarch of the hypervisor''s architecture (i.e., an EM_386 image on an EM_X86_64 hypervisor). The toolstack can now load images without kernel involvement. This is required for supporting kexec when using a dom0 with an upstream kernel. Crash images are copied directly into the crash region on load. Default images are copied into domheap pages and a list of source and destination machine addresses is created. This is list is used in kexec_reloc() to relocate the image to its destination. The old load and unload sub-ops are still available (as KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top of the new infrastructure. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> --- xen/arch/x86/machine_kexec.c | 282 +++++++++++++++++++------ xen/arch/x86/x86_64/Makefile | 2 +- xen/arch/x86/x86_64/compat_kexec.S | 187 ----------------- xen/arch/x86/x86_64/kexec_reloc.S | 211 +++++++++++++++++++ xen/common/kexec.c | 393 +++++++++++++++++++++++++++++------ xen/common/kimage.c | 103 +++++++++ xen/include/asm-x86/fixmap.h | 3 - xen/include/asm-x86/machine_kexec.h | 16 ++ xen/include/xen/kexec.h | 14 +- xen/include/xen/kimage.h | 6 + 10 files changed, 885 insertions(+), 332 deletions(-) delete mode 100644 xen/arch/x86/x86_64/compat_kexec.S create mode 100644 xen/arch/x86/x86_64/kexec_reloc.S create mode 100644 xen/include/asm-x86/machine_kexec.h diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c index 68b9705..d6a1082 100644 --- a/xen/arch/x86/machine_kexec.c +++ b/xen/arch/x86/machine_kexec.c @@ -1,9 +1,18 @@ /****************************************************************************** * machine_kexec.c * + * Copyright (C) 2013 Citrix Systems R&D Ltd. + * + * Portions derived from Linux''s arch/x86/kernel/machine_kexec_64.c. + * + * Copyright (C) 2002-2005 Eric Biederman <ebiederm@xmission.com> + * * Xen port written by: * - Simon ''Horms'' Horman <horms@verge.net.au> * - Magnus Damm <magnus@valinux.co.jp> + * + * This source code is licensed under the GNU General Public License, + * Version 2. See the file COPYING for more details. */ #include <xen/types.h> @@ -11,63 +20,216 @@ #include <xen/guest_access.h> #include <asm/fixmap.h> #include <asm/hpet.h> +#include <asm/page.h> +#include <asm/machine_kexec.h> + +static void init_level2_page(l2_pgentry_t *l2, unsigned long addr) +{ + unsigned long end_addr; + + addr &= PAGE_MASK; + end_addr = addr + L2_PAGETABLE_ENTRIES * (1ul << L2_PAGETABLE_SHIFT); -typedef void (*relocate_new_kernel_t)( - unsigned long indirection_page, - unsigned long *page_list, - unsigned long start_address, - unsigned int preserve_context); + while ( addr < end_addr ) + { + l2e_write(l2++, l2e_from_paddr(addr, __PAGE_HYPERVISOR | _PAGE_PSE)); -int machine_kexec_load(int type, int slot, xen_kexec_image_t *image) + addr += 1ul << L2_PAGETABLE_SHIFT; + } +} + +static int init_level3_page(struct kexec_image *image, l3_pgentry_t *l3, + unsigned long addr, unsigned long last_addr) { - unsigned long prev_ma = 0; - int fix_base = FIX_KEXEC_BASE_0 + (slot * (KEXEC_XEN_NO_PAGES >> 1)); - int k; + unsigned long end_addr; - /* setup fixmap to point to our pages and record the virtual address - * in every odd index in page_list[]. - */ + addr &= PAGE_MASK; + end_addr = addr + L3_PAGETABLE_ENTRIES * (1ul << L3_PAGETABLE_SHIFT); + + while( (addr < last_addr) && (addr < end_addr) ) + { + struct page_info *l2_page; + l2_pgentry_t *l2; + + l2_page = kimage_alloc_control_page(image, 0); + if ( !l2_page ) + return -ENOMEM; + l2 = __map_domain_page(l2_page); + init_level2_page(l2, addr); + unmap_domain_page(l2); + + l3e_write(l3++, l3e_from_page(l2_page, __PAGE_HYPERVISOR)); + + addr += 1ul << L3_PAGETABLE_SHIFT; + } + + return 0; +} + +/* + * Build a complete page table to identity map [addr, last_addr). + * + * Control pages are used so they do not overlap with the image source + * or destination. + */ +static int init_level4_page(struct kexec_image *image, l4_pgentry_t *l4, + unsigned long addr, unsigned long last_addr) +{ + unsigned long end_addr; + int result; + + addr &= PAGE_MASK; + end_addr = addr + L4_PAGETABLE_ENTRIES * (1ul << L4_PAGETABLE_SHIFT); + + while ( (addr < last_addr) && (addr < end_addr) ) + { + struct page_info *l3_page; + l3_pgentry_t *l3; + + l3_page = kimage_alloc_control_page(image, 0); + if ( !l3_page ) + return -ENOMEM; + l3 = __map_domain_page(l3_page); + result = init_level3_page(image, l3, addr, last_addr); + unmap_domain_page(l3); + if (result) + return result; + + l4e_write(l4++, l4e_from_page(l3_page, __PAGE_HYPERVISOR)); + + addr += 1ul << L4_PAGETABLE_SHIFT; + } + + return 0; +} + +/* + * Add a mapping for the control code page to the same virtual address + * as kexec_reloc. This allows us to keep running after these page + * tables are loaded in kexec_reloc. + * + * We don''t really need to allocate control pages here as these + * entries won''t be used while the kexec image is being copied, but it + * makes clean-up easier. + */ +static int init_transition_pgtable(struct kexec_image *image, l4_pgentry_t *l4) +{ + struct page_info *l3_page; + struct page_info *l2_page; + struct page_info *l1_page; + unsigned long vaddr, paddr; + l3_pgentry_t *l3 = NULL; + l2_pgentry_t *l2 = NULL; + l1_pgentry_t *l1 = NULL; + int ret = -ENOMEM; + + vaddr = (unsigned long)kexec_reloc; + paddr = page_to_maddr(image->control_code_page); + + l4 += l4_table_offset(vaddr); + if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) ) + { + l3_page = kimage_alloc_control_page(image, 0); + if ( !l3_page ) + goto out; + l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR)); + } + else + l3_page = l4e_get_page(*l4); + + l3 = __map_domain_page(l3_page); + l3 += l3_table_offset(vaddr); + if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) ) + { + l2_page = kimage_alloc_control_page(image, 0); + if ( !l2_page ) + goto out; + l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR)); + } + else + l2_page = l3e_get_page(*l3); + + l2 = __map_domain_page(l2_page); + l2 += l2_table_offset(vaddr); + if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) ) + { + l1_page = kimage_alloc_control_page(image, 0); + if ( !l1_page ) + goto out; + l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR)); + } + else + l1_page = l2e_get_page(*l2); + + l1 = __map_domain_page(l1_page); + l1 += l1_table_offset(vaddr); + l1e_write(l1, l1e_from_pfn(paddr >> PAGE_SHIFT, __PAGE_HYPERVISOR)); + + ret = 0; +out: + if ( l1 ) + unmap_domain_page(l1); + if ( l2 ) + unmap_domain_page(l2); + if ( l3 ) + unmap_domain_page(l3); + return ret; +} - for ( k = 0; k < KEXEC_XEN_NO_PAGES; k++ ) + +static int build_reloc_page_table(struct kexec_image *image) +{ + struct page_info *l4_page; + l4_pgentry_t *l4; + int result; + + l4_page = kimage_alloc_control_page(image, 0); + if ( !l4_page ) + return -ENOMEM; + l4 = __map_domain_page(l4_page); + + result = init_level4_page(image, l4, 0, max_page << PAGE_SHIFT); + if ( result == 0) + result = init_transition_pgtable(image, l4); + unmap_domain_page(l4); + if ( result ) + return result; + + image->aux_page = l4_page; + return 0; +} + +int machine_kexec_load(struct kexec_image *image) +{ + void *code_page; + int ret; + + switch ( image->arch ) { - if ( (k & 1) == 0 ) - { - /* Even pages: machine address. */ - prev_ma = image->page_list[k]; - } - else - { - /* Odd pages: va for previous ma. */ - if ( is_pv_32on64_domain(dom0) ) - { - /* - * The compatability bounce code sets up a page table - * with a 1-1 mapping of the first 1G of memory so - * VA==PA here. - * - * This Linux purgatory code still sets up separate - * high and low mappings on the control page (entries - * 0 and 1) but it is harmless if they are equal since - * that PT is not live at the time. - */ - image->page_list[k] = prev_ma; - } - else - { - set_fixmap(fix_base + (k >> 1), prev_ma); - image->page_list[k] = fix_to_virt(fix_base + (k >> 1)); - } - } + case EM_386: + case EM_X86_64: + break; + default: + return -EINVAL; } + code_page = __map_domain_page(image->control_code_page); + memcpy(code_page, kexec_reloc, kexec_reloc_size); + unmap_domain_page(code_page); + + ret = build_reloc_page_table(image); + if ( ret < 0 ) + return ret; + return 0; } -void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image) +void machine_kexec_unload(struct kexec_image *image) { + /* no-op. kimage_free() frees all control pages. */ } -void machine_reboot_kexec(xen_kexec_image_t *image) +void machine_reboot_kexec(struct kexec_image *image) { BUG_ON(smp_processor_id() != 0); smp_send_stop(); @@ -75,13 +237,10 @@ void machine_reboot_kexec(xen_kexec_image_t *image) BUG(); } -void machine_kexec(xen_kexec_image_t *image) +void machine_kexec(struct kexec_image *image) { - struct desc_ptr gdt_desc = { - .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY), - .limit = LAST_RESERVED_GDT_BYTE - }; int i; + unsigned long reloc_flags = 0; /* We are about to permenantly jump out of the Xen context into the kexec * purgatory code. We really dont want to be still servicing interupts. @@ -109,29 +268,12 @@ void machine_kexec(xen_kexec_image_t *image) * not like running with NMIs disabled. */ enable_nmis(); - /* - * compat_machine_kexec() returns to idle pagetables, which requires us - * to be running on a static GDT mapping (idle pagetables have no GDT - * mappings in their per-domain mapping area). - */ - asm volatile ( "lgdt %0" : : "m" (gdt_desc) ); + if ( image->arch == EM_386 ) + reloc_flags |= KEXEC_RELOC_FLAG_COMPAT; - if ( is_pv_32on64_domain(dom0) ) - { - compat_machine_kexec(image->page_list[1], - image->indirection_page, - image->page_list, - image->start_address); - } - else - { - relocate_new_kernel_t rnk; - - rnk = (relocate_new_kernel_t) image->page_list[1]; - (*rnk)(image->indirection_page, image->page_list, - image->start_address, - 0 /* preserve_context */); - } + kexec_reloc(page_to_maddr(image->control_code_page), + page_to_maddr(image->aux_page), + image->head, image->entry_maddr, reloc_flags); } int machine_kexec_get(xen_kexec_range_t *range) diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile index d56e12d..7f8fb3d 100644 --- a/xen/arch/x86/x86_64/Makefile +++ b/xen/arch/x86/x86_64/Makefile @@ -11,11 +11,11 @@ obj-y += mmconf-fam10h.o obj-y += mmconfig_64.o obj-y += mmconfig-shared.o obj-y += compat.o -obj-bin-y += compat_kexec.o obj-y += domain.o obj-y += physdev.o obj-y += platform_hypercall.o obj-y += cpu_idle.o obj-y += cpufreq.o +obj-bin-y += kexec_reloc.o obj-$(crash_debug) += gdbstub.o diff --git a/xen/arch/x86/x86_64/compat_kexec.S b/xen/arch/x86/x86_64/compat_kexec.S deleted file mode 100644 index fc92af9..0000000 --- a/xen/arch/x86/x86_64/compat_kexec.S +++ /dev/null @@ -1,187 +0,0 @@ -/* - * Compatibility kexec handler. - */ - -/* - * NOTE: We rely on Xen not relocating itself above the 4G boundary. This is - * currently true but if it ever changes then compat_pg_table will - * need to be moved back below 4G at run time. - */ - -#include <xen/config.h> - -#include <asm/asm_defns.h> -#include <asm/msr.h> -#include <asm/page.h> - -/* The unrelocated physical address of a symbol. */ -#define SYM_PHYS(sym) ((sym) - __XEN_VIRT_START) - -/* Load physical address of symbol into register and relocate it. */ -#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \ - add xen_phys_start(%rip), reg - -/* - * Relocate a physical address in memory. Size of temporary register - * determines size of the value to relocate. - */ -#define RELOCATE_MEM(addr,reg) mov addr(%rip), reg ; \ - add xen_phys_start(%rip), reg ; \ - mov reg, addr(%rip) - - .text - - .code64 - -ENTRY(compat_machine_kexec) - /* x86/64 x86/32 */ - /* %rdi - relocate_new_kernel_t CALL */ - /* %rsi - indirection page 4(%esp) */ - /* %rdx - page_list 8(%esp) */ - /* %rcx - start address 12(%esp) */ - /* cpu has pae 16(%esp) */ - - /* Shim the 64 bit page_list into a 32 bit page_list. */ - mov $12,%r9 - lea compat_page_list(%rip), %rbx -1: dec %r9 - movl (%rdx,%r9,8),%eax - movl %eax,(%rbx,%r9,4) - test %r9,%r9 - jnz 1b - - RELOCATE_SYM(compat_page_list,%rdx) - - /* Relocate compatibility mode entry point address. */ - RELOCATE_MEM(compatibility_mode_far,%eax) - - /* Relocate compat_pg_table. */ - RELOCATE_MEM(compat_pg_table, %rax) - RELOCATE_MEM(compat_pg_table+0x8, %rax) - RELOCATE_MEM(compat_pg_table+0x10,%rax) - RELOCATE_MEM(compat_pg_table+0x18,%rax) - - /* - * Setup an identity mapped region in PML4[0] of idle page - * table. - */ - RELOCATE_SYM(l3_identmap,%rax) - or $0x63,%rax - mov %rax, idle_pg_table(%rip) - - /* Switch to idle page table. */ - RELOCATE_SYM(idle_pg_table,%rax) - movq %rax, %cr3 - - /* Switch to identity mapped compatibility stack. */ - RELOCATE_SYM(compat_stack,%rax) - movq %rax, %rsp - - /* Save xen_phys_start for 32 bit code. */ - movq xen_phys_start(%rip), %rbx - - /* Jump to low identity mapping in compatibility mode. */ - ljmp *compatibility_mode_far(%rip) - ud2 - -compatibility_mode_far: - .long SYM_PHYS(compatibility_mode) - .long __HYPERVISOR_CS32 - - /* - * We use 5 words of stack for the arguments passed to the kernel. The - * kernel only uses 1 word before switching to its own stack. Allocate - * 16 words to give "plenty" of room. - */ - .fill 16,4,0 -compat_stack: - - .code32 - -#undef RELOCATE_SYM -#undef RELOCATE_MEM - -/* - * Load physical address of symbol into register and relocate it. %rbx - * contains xen_phys_start(%rip) saved before jump to compatibility - * mode. - */ -#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \ - add %ebx, reg - -compatibility_mode: - /* Setup some sane segments. */ - movl $__HYPERVISOR_DS32, %eax - movl %eax, %ds - movl %eax, %es - movl %eax, %fs - movl %eax, %gs - movl %eax, %ss - - /* Push arguments onto stack. */ - pushl $0 /* 20(%esp) - preserve context */ - pushl $1 /* 16(%esp) - cpu has pae */ - pushl %ecx /* 12(%esp) - start address */ - pushl %edx /* 8(%esp) - page list */ - pushl %esi /* 4(%esp) - indirection page */ - pushl %edi /* 0(%esp) - CALL */ - - /* Disable paging and therefore leave 64 bit mode. */ - movl %cr0, %eax - andl $~X86_CR0_PG, %eax - movl %eax, %cr0 - - /* Switch to 32 bit page table. */ - RELOCATE_SYM(compat_pg_table, %eax) - movl %eax, %cr3 - - /* Clear MSR_EFER[LME], disabling long mode */ - movl $MSR_EFER,%ecx - rdmsr - btcl $_EFER_LME,%eax - wrmsr - - /* Re-enable paging, but only 32 bit mode now. */ - movl %cr0, %eax - orl $X86_CR0_PG, %eax - movl %eax, %cr0 - jmp 1f -1: - - popl %eax - call *%eax - ud2 - - .data - .align 4 -compat_page_list: - .fill 12,4,0 - - .align 32,0 - - /* - * These compat page tables contain an identity mapping of the - * first 4G of the physical address space. - */ -compat_pg_table: - .long SYM_PHYS(compat_pg_table_l2) + 0*PAGE_SIZE + 0x01, 0 - .long SYM_PHYS(compat_pg_table_l2) + 1*PAGE_SIZE + 0x01, 0 - .long SYM_PHYS(compat_pg_table_l2) + 2*PAGE_SIZE + 0x01, 0 - .long SYM_PHYS(compat_pg_table_l2) + 3*PAGE_SIZE + 0x01, 0 - - .section .data.page_aligned, "aw", @progbits - .align PAGE_SIZE,0 -compat_pg_table_l2: - .macro identmap from=0, count=512 - .if \count-1 - identmap "(\from+0)","(\count/2)" - identmap "(\from+(0x200000*(\count/2)))","(\count/2)" - .else - .quad 0x00000000000000e3 + \from - .endif - .endm - - identmap 0x00000000 - identmap 0x40000000 - identmap 0x80000000 - identmap 0xc0000000 diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S new file mode 100644 index 0000000..135cbcd --- /dev/null +++ b/xen/arch/x86/x86_64/kexec_reloc.S @@ -0,0 +1,211 @@ +/* + * Relocate a kexec_image to its destination and call it. + * + * Copyright (C) 2013 Citrix Systems R&D Ltd. + * + * Portions derived from Linux''s arch/x86/kernel/relocate_kernel_64.S. + * + * Copyright (C) 2002-2005 Eric Biederman <ebiederm@xmission.com> + * + * This source code is licensed under the GNU General Public License, + * Version 2. See the file COPYING for more details. + */ +#include <xen/config.h> + +#include <asm/asm_defns.h> +#include <asm/msr.h> +#include <asm/page.h> +#include <asm/machine_kexec.h> + + .text + .align PAGE_SIZE + .code64 + +ENTRY(kexec_reloc) + /* %rdi - code page maddr */ + /* %rsi - page table maddr */ + /* %rdx - indirection page maddr */ + /* %rcx - entry maddr */ + /* %r8 - flags */ + + movq %rdx, %rbx + + /* Setup stack. */ + leaq (reloc_stack - kexec_reloc)(%rdi), %rsp + + /* Load reloc page table. */ + movq %rsi, %cr3 + + /* Jump to identity mapped code. */ + leaq (identity_mapped - kexec_reloc)(%rdi), %rax + jmpq *%rax + +identity_mapped: + pushq %rcx + pushq %rbx + pushq %rsi + pushq %rdi + + /* + * Set cr0 to a known state: + * - Paging enabled + * - Alignment check disabled + * - Write protect disabled + * - No task switch + * - Don''t do FP software emulation. + * - Proctected mode enabled + */ + movq %cr0, %rax + andq $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %rax + orl $(X86_CR0_PG | X86_CR0_PE), %eax + movq %rax, %cr0 + + /* + * Set cr4 to a known state: + * - physical address extension enabled + */ + movq $X86_CR4_PAE, %rax + movq %rax, %cr4 + + movq %rbx, %rdi + call relocate_pages + + popq %rdi + popq %rsi + popq %rbx + popq %rcx + + /* Need to switch to 32-bit mode? */ + testq $KEXEC_RELOC_FLAG_COMPAT, %r8 + jnz call_32_bit + +call_64_bit: + /* Call the image entry point. This should never return. */ + call *%rcx + ud2 + +call_32_bit: + /* Setup IDT. */ + lidt compat_mode_idt(%rip) + + /* Load compat GDT. */ + leaq (compat_mode_gdt - kexec_reloc)(%rdi), %rax + movq %rax, (compat_mode_gdt_desc + 2)(%rip) + lgdt compat_mode_gdt_desc(%rip) + + /* Relocate compatibility mode entry point address. */ + leal (compatibility_mode - kexec_reloc)(%edi), %eax + movl %eax, compatibility_mode_far(%rip) + + /* Enter compatibility mode. */ + ljmp *compatibility_mode_far(%rip) + +relocate_pages: + /* %rdi - indirection page maddr */ + cld + movq %rdi, %rcx + xorq %rdi, %rdi + xorq %rsi, %rsi + jmp 1f + +0: /* top, read another word for the indirection page */ + + movq (%rbx), %rcx + addq $8, %rbx +1: + testq $0x1, %rcx /* is it a destination page? */ + jz 2f + movq %rcx, %rdi + andq $0xfffffffffffff000, %rdi + jmp 0b +2: + testq $0x2, %rcx /* is it an indirection page? */ + jz 2f + movq %rcx, %rbx + andq $0xfffffffffffff000, %rbx + jmp 0b +2: + testq $0x4, %rcx /* is it the done indicator? */ + jz 2f + jmp 3f +2: + testq $0x8, %rcx /* is it the source indicator? */ + jz 2f + movq %rcx, %rsi /* For ever source page do a copy */ + andq $0xfffffffffffff000, %rsi + movq $512, %rcx + rep movsq + jmp 0b +2: + testq $0x10, %rcx /* is it the zero indicator? */ + jz 0b /* Ignore it otherwise */ + movq $512, %rcx /* Zero the destination page. */ + xorq %rax, %rax + rep stosq + jmp 0b +3: + ret + + .code32 + +compatibility_mode: + /* Setup some sane segments. */ + movl $0x0008, %eax + movl %eax, %ds + movl %eax, %es + movl %eax, %fs + movl %eax, %gs + movl %eax, %ss + + movl %ecx, %ebp + + /* Disable paging and therefore leave 64 bit mode. */ + movl %cr0, %eax + andl $~X86_CR0_PG, %eax + movl %eax, %cr0 + + /* Disable long mode */ + movl $MSR_EFER, %ecx + rdmsr + andl $~EFER_LME, %eax + wrmsr + + /* Clear cr4 to disable PAE. */ + movl $0, %eax + movl %eax, %cr4 + + /* Call the image entry point. This should never return. */ + call *%ebp + ud2 + + .align 16 +compatibility_mode_far: + .long 0x00000000 /* set in call_32_bit above */ + .word 0x0010 + + .align 16 +compat_mode_gdt_desc: + .word (3*8)-1 + .quad 0x0000000000000000 /* set in call_32_bit above */ + + .align 16 +compat_mode_gdt: + .quad 0x0000000000000000 /* null */ + .quad 0x00cf92000000ffff /* 0x0008 ring 0 data */ + .quad 0x00cf9a000000ffff /* 0x0010 ring 0 code, compatibility */ + +compat_mode_idt: + .word 0 /* limit */ + .long 0 /* base */ + + /* + * 16 words of stack are more than enough. + */ + .fill 16,8,0 +reloc_stack: + + .globl __kexec_reloc_size + .set __kexec_reloc_size, . - kexec_reloc + .globl kexec_reloc_size +kexec_reloc_size: + .quad __kexec_reloc_size diff --git a/xen/common/kexec.c b/xen/common/kexec.c index 2d34524..b180e0c 100644 --- a/xen/common/kexec.c +++ b/xen/common/kexec.c @@ -25,6 +25,7 @@ #include <xen/version.h> #include <xen/console.h> #include <xen/kexec.h> +#include <xen/kimage.h> #include <public/elfnote.h> #include <xsm/xsm.h> #include <xen/cpu.h> @@ -47,7 +48,7 @@ static Elf_Note *xen_crash_note; static cpumask_t crash_saved_cpus; -static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR]; +static struct kexec_image *kexec_image[KEXEC_IMAGE_NR]; #define KEXEC_FLAG_DEFAULT_POS (KEXEC_IMAGE_NR + 0) #define KEXEC_FLAG_CRASH_POS (KEXEC_IMAGE_NR + 1) @@ -311,14 +312,14 @@ void kexec_crash(void) kexec_common_shutdown(); kexec_crash_save_cpu(); machine_crash_shutdown(); - machine_kexec(&kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]); + machine_kexec(kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]); BUG(); } static long kexec_reboot(void *_image) { - xen_kexec_image_t *image = _image; + struct kexec_image *image = _image; kexecing = TRUE; @@ -734,63 +735,261 @@ static void crash_save_vmcoreinfo(void) #endif } -static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load) +static void kexec_unload_image(struct kexec_image *image) +{ + if ( !image ) + return; + + machine_kexec_unload(image); + kimage_free(image); +} + +static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg) +{ + xen_kexec_exec_t exec; + struct kexec_image *image; + int base, bit, pos, ret = -EINVAL; + + if ( unlikely(copy_from_guest(&exec, uarg, 1)) ) + return -EFAULT; + + if ( kexec_load_get_bits(exec.type, &base, &bit) ) + return -EINVAL; + + pos = (test_bit(bit, &kexec_flags) != 0); + + /* Only allow kexec/kdump into loaded images */ + if ( !test_bit(base + pos, &kexec_flags) ) + return -ENOENT; + + switch (exec.type) + { + case KEXEC_TYPE_DEFAULT: + image = kexec_image[base + pos]; + ret = continue_hypercall_on_cpu(0, kexec_reboot, image); + break; + case KEXEC_TYPE_CRASH: + kexec_crash(); /* Does not return */ + break; + } + + return -EINVAL; /* never reached */ +} + +static int kexec_swap_images(int type, struct kexec_image *new, + struct kexec_image **old) { - xen_kexec_image_t *image; int base, bit, pos; - int ret = 0; + int new_slot, old_slot; + + *old = NULL; + + spin_lock(&kexec_lock); + + if ( test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) ) + { + spin_unlock(&kexec_lock); + return -EBUSY; + } - if ( kexec_load_get_bits(load->type, &base, &bit) ) + if ( kexec_load_get_bits(type, &base, &bit) ) return -EINVAL; pos = (test_bit(bit, &kexec_flags) != 0); + old_slot = base + pos; + new_slot = base + !pos; - /* Load the user data into an unused image */ - if ( op == KEXEC_CMD_kexec_load ) + if ( new ) { - image = &kexec_image[base + !pos]; + kexec_image[new_slot] = new; + set_bit(new_slot, &kexec_flags); + } + change_bit(bit, &kexec_flags); - BUG_ON(test_bit((base + !pos), &kexec_flags)); /* must be free */ + clear_bit(old_slot, &kexec_flags); + *old = kexec_image[old_slot]; - memcpy(image, &load->image, sizeof(*image)); + spin_unlock(&kexec_lock); - if ( !(ret = machine_kexec_load(load->type, base + !pos, image)) ) - { - /* Set image present bit */ - set_bit((base + !pos), &kexec_flags); + return 0; +} - /* Make new image the active one */ - change_bit(bit, &kexec_flags); - } +static int kexec_load_slot(struct kexec_image *kimage) +{ + struct kexec_image *old_kimage; + int ret = -ENOMEM; + + ret = machine_kexec_load(kimage); + if ( ret < 0 ) + return ret; + + crash_save_vmcoreinfo(); + + ret = kexec_swap_images(kimage->type, kimage, &old_kimage); + if ( ret < 0 ) + return ret; + + kexec_unload_image(old_kimage); + + return 0; +} + +static uint16_t kexec_load_v1_arch(void) +{ +#ifdef CONFIG_X86 + return is_pv_32on64_domain(dom0) ? EM_386 : EM_X86_64; +#else + return EM_NONE; +#endif +} + +static int kexec_segments_add_segment( + unsigned *nr_segments, xen_kexec_segment_t *segments, + unsigned long mfn) +{ + paddr_t maddr = (paddr_t)mfn << PAGE_SHIFT; + int n = *nr_segments; - crash_save_vmcoreinfo(); + /* Need a new segment? */ + if ( n == 0 + || segments[n-1].dest_maddr + segments[n-1].dest_size != maddr ) + { + n++; + if ( n > KEXEC_SEGMENT_MAX ) + return -EINVAL; + *nr_segments = n; + + set_xen_guest_handle(segments[n-1].buf, NULL); + segments[n-1].buf_size = 0; + segments[n-1].dest_maddr = maddr; + segments[n-1].dest_size = 0; } - /* Unload the old image if present and load successful */ - if ( ret == 0 && !test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) ) + return 0; +} + +static int kexec_segments_from_ind_page(unsigned long mfn, + unsigned *nr_segments, + xen_kexec_segment_t *segments, + bool_t compat) +{ + void *page; + kimage_entry_t *entry; + int ret = 0; + + page = map_domain_page(mfn); + + /* + * Walk the indirection page list, adding destination pages to the + * segments. + */ + for ( entry = page; ; ) { - if ( test_and_clear_bit((base + pos), &kexec_flags) ) + unsigned long ind; + + ind = kimage_entry_ind(entry, compat); + mfn = kimage_entry_mfn(entry, compat); + + switch ( ind ) { - image = &kexec_image[base + pos]; - machine_kexec_unload(load->type, base + pos, image); + case IND_DESTINATION: + ret = kexec_segments_add_segment(nr_segments, segments, mfn); + if ( ret < 0 ) + goto done; + break; + case IND_INDIRECTION: + unmap_domain_page(page); + page = map_domain_page(mfn); + if ( page == NULL ) + return -ENOMEM; + entry = page; + continue; + case IND_DONE: + goto done; + case IND_SOURCE: + segments[*nr_segments-1].dest_size += PAGE_SIZE; + break; + default: + ret = -EINVAL; + goto done; } + entry = kimage_entry_next(entry, compat); } +done: + unmap_domain_page(page); + return ret; +} + +static int kexec_do_load_v1(xen_kexec_load_v1_t *load, int compat) +{ + struct kexec_image *kimage = NULL; + xen_kexec_segment_t *segments; + uint16_t arch; + unsigned nr_segments = 0; + unsigned long ind_mfn = load->image.indirection_page >> PAGE_SHIFT; + int ret; + + arch = kexec_load_v1_arch(); + if ( arch == EM_NONE ) + return -ENOSYS; + + segments = xmalloc_array(xen_kexec_segment_t, KEXEC_SEGMENT_MAX); + if ( segments == NULL ) + return -ENOMEM; + + /* + * Work out the image segments (destination only) from the + * indirection pages. + * + * This is needed so we don''t allocate pages that will overlap + * with the destination when building the new set of indirection + * pages below. + */ + ret = kexec_segments_from_ind_page(ind_mfn, &nr_segments, segments, compat); + if ( ret < 0 ) + goto error; + + ret = kimage_alloc(&kimage, load->type, arch, load->image.start_address, + nr_segments, segments); + if ( ret < 0 ) + goto error; + + /* + * Build a new set of indirection pages in the native format. + * + * This walks the guest provided indirection pages a second time. + * The guest could have altered then, invalidating the segment + * information constructed above. This will only result in the + * resulting image being potentially unrelocatable. + */ + ret = kimage_build_ind(kimage, ind_mfn, compat); + if ( ret < 0 ) + goto error; + + ret = kexec_load_slot(kimage); + if ( ret < 0 ) + goto error; + return 0; + +error: + if ( !kimage ) + xfree(segments); + kimage_free(kimage); return ret; } -static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg) +static int kexec_load_v1(XEN_GUEST_HANDLE_PARAM(void) uarg) { xen_kexec_load_v1_t load; if ( unlikely(copy_from_guest(&load, uarg, 1)) ) return -EFAULT; - return kexec_load_unload_internal(op, &load); + return kexec_do_load_v1(&load, 0); } -static int kexec_load_unload_compat(unsigned long op, - XEN_GUEST_HANDLE_PARAM(void) uarg) +static int kexec_load_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg) { #ifdef CONFIG_COMPAT compat_kexec_load_v1_t compat_load; @@ -809,49 +1008,113 @@ static int kexec_load_unload_compat(unsigned long op, load.type = compat_load.type; XLAT_kexec_image(&load.image, &compat_load.image); - return kexec_load_unload_internal(op, &load); -#else /* CONFIG_COMPAT */ + return kexec_do_load_v1(&load, 1); +#else return 0; -#endif /* CONFIG_COMPAT */ +#endif } -static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg) +static int kexec_load(XEN_GUEST_HANDLE_PARAM(void) uarg) { - xen_kexec_exec_t exec; - xen_kexec_image_t *image; - int base, bit, pos, ret = -EINVAL; + xen_kexec_load_t load; + xen_kexec_segment_t *segments; + struct kexec_image *kimage = NULL; + int ret; - if ( unlikely(copy_from_guest(&exec, uarg, 1)) ) + if ( copy_from_guest(&load, uarg, 1) ) return -EFAULT; - if ( kexec_load_get_bits(exec.type, &base, &bit) ) + if ( load.nr_segments >= KEXEC_SEGMENT_MAX ) return -EINVAL; - pos = (test_bit(bit, &kexec_flags) != 0); - - /* Only allow kexec/kdump into loaded images */ - if ( !test_bit(base + pos, &kexec_flags) ) - return -ENOENT; + segments = xmalloc_array(xen_kexec_segment_t, load.nr_segments); + if ( segments == NULL ) + return -ENOMEM; - switch (exec.type) + if ( copy_from_guest(segments, load.segments, load.nr_segments) ) { - case KEXEC_TYPE_DEFAULT: - image = &kexec_image[base + pos]; - ret = continue_hypercall_on_cpu(0, kexec_reboot, image); - break; - case KEXEC_TYPE_CRASH: - kexec_crash(); /* Does not return */ - break; + ret = -EFAULT; + goto error; } - return -EINVAL; /* never reached */ + ret = kimage_alloc(&kimage, load.type, load.arch, load.entry_maddr, + load.nr_segments, segments); + if ( ret < 0 ) + goto error; + + ret = kimage_load_segments(kimage); + if ( ret < 0 ) + goto error; + + ret = kexec_load_slot(kimage); + if ( ret < 0 ) + goto error; + + return 0; + +error: + if ( ! kimage ) + xfree(segments); + kimage_free(kimage); + return ret; +} + +static int kexec_do_unload(xen_kexec_unload_t *unload) +{ + struct kexec_image *old_kimage; + int ret; + + ret = kexec_swap_images(unload->type, NULL, &old_kimage); + if ( ret < 0 ) + return ret; + + kexec_unload_image(old_kimage); + + return 0; +} + +static int kexec_unload_v1(XEN_GUEST_HANDLE_PARAM(void) uarg) +{ + xen_kexec_load_v1_t load; + xen_kexec_unload_t unload; + + if ( copy_from_guest(&load, uarg, 1) ) + return -EFAULT; + + unload.type = load.type; + return kexec_do_unload(&unload); +} + +static int kexec_unload_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg) +{ +#ifdef CONFIG_COMPAT + compat_kexec_load_v1_t compat_load; + xen_kexec_unload_t unload; + + if ( copy_from_guest(&compat_load, uarg, 1) ) + return -EFAULT; + + unload.type = compat_load.type; + return kexec_do_unload(&unload); +#else + return 0; +#endif +} + +static int kexec_unload(XEN_GUEST_HANDLE_PARAM(void) uarg) +{ + xen_kexec_unload_t unload; + + if ( unlikely(copy_from_guest(&unload, uarg, 1)) ) + return -EFAULT; + + return kexec_do_unload(&unload); } static int do_kexec_op_internal(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg, bool_t compat) { - unsigned long flags; int ret = -EINVAL; ret = xsm_kexec(XSM_PRIV); @@ -867,20 +1130,26 @@ static int do_kexec_op_internal(unsigned long op, ret = kexec_get_range(uarg); break; case KEXEC_CMD_kexec_load_v1: + if ( compat ) + ret = kexec_load_v1_compat(uarg); + else + ret = kexec_load_v1(uarg); + break; case KEXEC_CMD_kexec_unload_v1: - spin_lock_irqsave(&kexec_lock, flags); - if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags)) - { - if (compat) - ret = kexec_load_unload_compat(op, uarg); - else - ret = kexec_load_unload(op, uarg); - } - spin_unlock_irqrestore(&kexec_lock, flags); + if ( compat ) + ret = kexec_unload_v1_compat(uarg); + else + ret = kexec_unload_v1(uarg); break; case KEXEC_CMD_kexec: ret = kexec_exec(uarg); break; + case KEXEC_CMD_kexec_load: + ret = kexec_load(uarg); + break; + case KEXEC_CMD_kexec_unload: + ret = kexec_unload(uarg); + break; } return ret; diff --git a/xen/common/kimage.c b/xen/common/kimage.c index 995ce36..1cc0ef7 100644 --- a/xen/common/kimage.c +++ b/xen/common/kimage.c @@ -812,6 +812,109 @@ int kimage_load_segments(struct kexec_image *image) return 0; } +kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat) +{ + if ( compat ) + return (kimage_entry_t *)((uint32_t *)entry + 1); + return entry + 1; +} + +unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat) +{ + if ( compat ) + return *(uint32_t *)entry >> PAGE_SHIFT; + return *entry >> PAGE_SHIFT; +} + +unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat) +{ + if ( compat ) + return *(uint32_t *)entry & 0xf; + return *entry & 0xf; +} + +int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn, + bool_t compat) +{ + void *page; + kimage_entry_t *entry; + int ret = 0; + paddr_t dest = KIMAGE_NO_DEST; + + page = map_domain_page(ind_mfn); + if ( page == NULL ) + return -ENOMEM; + + /* + * Walk the guest-supplied indirection pages, adding entries to + * the image''s indirection pages. + */ + for ( entry = page; ; ) + { + unsigned long ind; + unsigned long mfn; + + ind = kimage_entry_ind(entry, compat); + mfn = kimage_entry_mfn(entry, compat); + + switch ( ind ) + { + case IND_DESTINATION: + dest = (paddr_t)mfn << PAGE_SHIFT; + ret = kimage_set_destination(image, dest); + if ( ret < 0 ) + goto done; + break; + case IND_INDIRECTION: + unmap_domain_page(page); + page = map_domain_page(mfn); + entry = page; + continue; + case IND_DONE: + kimage_terminate(image); + goto done; + case IND_SOURCE: + { + struct page_info *guest_page, *xen_page; + + guest_page = mfn_to_page(mfn); + if ( !get_page(guest_page, current->domain) ) + { + ret = -EFAULT; + goto done; + } + + xen_page = kimage_alloc_page(image, dest); + if ( !xen_page ) + { + put_page(guest_page); + ret = -ENOMEM; + goto done; + } + + copy_domain_page(page_to_mfn(xen_page), mfn); + put_page(guest_page); + + ret = kimage_add_page(image, page_to_maddr(xen_page)); + if ( ret < 0 ) + goto done; + dest += PAGE_SIZE; + break; + } + default: + ret = -EINVAL; + goto done; + } + entry = kimage_entry_next(entry, compat); + } +done: + unmap_domain_page(page); + return ret; +} + + + + /* * Local variables: * mode: C diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h index 2eefcf4..1695228 100644 --- a/xen/include/asm-x86/fixmap.h +++ b/xen/include/asm-x86/fixmap.h @@ -57,9 +57,6 @@ enum fixed_addresses { FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1, FIX_HPET_BASE, FIX_CYCLONE_TIMER, - FIX_KEXEC_BASE_0, - FIX_KEXEC_BASE_END = FIX_KEXEC_BASE_0 \ - + ((KEXEC_XEN_NO_PAGES >> 1) * KEXEC_IMAGE_NR) - 1, FIX_IOMMU_REGS_BASE_0, FIX_IOMMU_REGS_END = FIX_IOMMU_REGS_BASE_0 + MAX_IOMMUS-1, FIX_IOMMU_MMIO_BASE_0, diff --git a/xen/include/asm-x86/machine_kexec.h b/xen/include/asm-x86/machine_kexec.h new file mode 100644 index 0000000..9f7c29e --- /dev/null +++ b/xen/include/asm-x86/machine_kexec.h @@ -0,0 +1,16 @@ +#ifndef __X86_MACHINE_KEXEC_H__ +#define __X86_MACHINE_KEXEC_H__ + +#define KEXEC_RELOC_FLAG_COMPAT 0x1 /* 32-bit image */ + +#ifndef __ASSEMBLY__ + +extern void kexec_reloc(unsigned long reloc_code, unsigned long reloc_pt, + unsigned long ind_maddr, unsigned long entry_maddr, + unsigned long flags); + +extern unsigned long kexec_reloc_size; + +#endif + +#endif /* __X86_MACHINE_KEXEC_H__ */ diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h index 1a5dda1..7bb9213 100644 --- a/xen/include/xen/kexec.h +++ b/xen/include/xen/kexec.h @@ -6,6 +6,7 @@ #include <public/kexec.h> #include <asm/percpu.h> #include <xen/elfcore.h> +#include <xen/kimage.h> typedef struct xen_kexec_reserve { unsigned long size; @@ -40,11 +41,11 @@ extern enum low_crashinfo low_crashinfo_mode; extern paddr_t crashinfo_maxaddr_bits; void kexec_early_calculations(void); -int machine_kexec_load(int type, int slot, xen_kexec_image_t *image); -void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image); +int machine_kexec_load(struct kexec_image *image); +void machine_kexec_unload(struct kexec_image *image); void machine_kexec_reserved(xen_kexec_reserve_t *reservation); -void machine_reboot_kexec(xen_kexec_image_t *image); -void machine_kexec(xen_kexec_image_t *image); +void machine_reboot_kexec(struct kexec_image *image); +void machine_kexec(struct kexec_image *image); void kexec_crash(void); void kexec_crash_save_cpu(void); crash_xen_info_t *kexec_crash_save_info(void); @@ -52,11 +53,6 @@ void machine_crash_shutdown(void); int machine_kexec_get(xen_kexec_range_t *range); int machine_kexec_get_xen(xen_kexec_range_t *range); -void compat_machine_kexec(unsigned long rnk, - unsigned long indirection_page, - unsigned long *page_list, - unsigned long start_address); - /* vmcoreinfo stuff */ #define VMCOREINFO_BYTES (4096) #define VMCOREINFO_NOTE_NAME "VMCOREINFO_XEN" diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h index 9555688..fb62c6f 100644 --- a/xen/include/xen/kimage.h +++ b/xen/include/xen/kimage.h @@ -46,6 +46,12 @@ int kimage_load_segments(struct kexec_image *image); struct page_info *kimage_alloc_control_page(struct kexec_image *image, unsigned memflags); +kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat); +unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat); +unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat); +int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn, + bool_t compat); + #endif /* __XEN_KIMAGE_H__ */ /* -- 1.7.2.5
From: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> --- xen/common/kexec.c | 2 ++ xen/common/shutdown.c | 3 +++ 2 files changed, 5 insertions(+), 0 deletions(-) diff --git a/xen/common/kexec.c b/xen/common/kexec.c index b180e0c..75118cf 100644 --- a/xen/common/kexec.c +++ b/xen/common/kexec.c @@ -307,6 +307,8 @@ void kexec_crash(void) if ( !test_bit(KEXEC_IMAGE_CRASH_BASE + pos, &kexec_flags) ) return; + printk("Executing crash image\n"); + kexecing = TRUE; kexec_common_shutdown(); diff --git a/xen/common/shutdown.c b/xen/common/shutdown.c index 73a7d7b..b676a03 100644 --- a/xen/common/shutdown.c +++ b/xen/common/shutdown.c @@ -47,6 +47,9 @@ void dom0_shutdown(u8 reason) { debugger_trap_immediate(); printk("Domain 0 crashed: "); +#ifdef CONFIG_KEXEC + kexec_crash(); +#endif maybe_reboot(); break; /* not reached */ } -- 1.7.2.5
From: David Vrabel <david.vrabel@citrix.com> Hypercall buffer arrays are used when a hypercall takes a variable length array of buffers. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> --- tools/libxc/xc_hcall_buf.c | 73 ++++++++++++++++++++++++++++++++++++++++++++ tools/libxc/xenctrl.h | 27 ++++++++++++++++ 2 files changed, 100 insertions(+), 0 deletions(-) diff --git a/tools/libxc/xc_hcall_buf.c b/tools/libxc/xc_hcall_buf.c index c354677..e762a93 100644 --- a/tools/libxc/xc_hcall_buf.c +++ b/tools/libxc/xc_hcall_buf.c @@ -228,6 +228,79 @@ void xc__hypercall_bounce_post(xc_interface *xch, xc_hypercall_buffer_t *b) xc__hypercall_buffer_free(xch, b); } +struct xc_hypercall_buffer_array { + unsigned max_bufs; + xc_hypercall_buffer_t *bufs; +}; + +xc_hypercall_buffer_array_t *xc_hypercall_buffer_array_create(xc_interface *xch, + unsigned n) +{ + xc_hypercall_buffer_array_t *array; + xc_hypercall_buffer_t *bufs = NULL; + + array = malloc(sizeof(*array)); + if ( array == NULL ) + goto error; + + bufs = calloc(n, sizeof(*bufs)); + if ( bufs == NULL ) + goto error; + + array->max_bufs = n; + array->bufs = bufs; + + return array; + +error: + free(bufs); + free(array); + return NULL; +} + +void *xc__hypercall_buffer_array_alloc(xc_interface *xch, + xc_hypercall_buffer_array_t *array, + unsigned index, + xc_hypercall_buffer_t *hbuf, + size_t size) +{ + void *buf; + + if ( index >= array->max_bufs || array->bufs[index].hbuf ) + abort(); + + buf = xc__hypercall_buffer_alloc(xch, hbuf, size); + if ( buf ) + array->bufs[index] = *hbuf; + return buf; +} + +void *xc__hypercall_buffer_array_get(xc_interface *xch, + xc_hypercall_buffer_array_t *array, + unsigned index, + xc_hypercall_buffer_t *hbuf) +{ + if ( index >= array->max_bufs || array->bufs[index].hbuf == NULL ) + abort(); + + *hbuf = array->bufs[index]; + return array->bufs[index].hbuf; +} + +void xc_hypercall_buffer_array_destroy(xc_interface *xc, + xc_hypercall_buffer_array_t *array) +{ + unsigned i; + + if ( array == NULL ) + return; + + for (i = 0; i < array->max_bufs; i++ ) + xc__hypercall_buffer_free(xc, &array->bufs[i]); + free(array->bufs); + free(array); +} + /* * Local variables: * mode: C diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h index 5697765..f74f5de 100644 --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -321,6 +321,33 @@ void xc__hypercall_buffer_free_pages(xc_interface *xch, xc_hypercall_buffer_t *b #define xc_hypercall_buffer_free_pages(_xch, _name, _nr) xc__hypercall_buffer_free_pages(_xch, HYPERCALL_BUFFER(_name), _nr) /* + * Array of hypercall buffers. + * + * Create an array with xc_hypercall_buffer_array_create() and + * populate it by declaring one hypercall buffer in a loop and + * allocating the buffer with xc_hypercall_buffer_array_alloc(). + * + * To access a previously allocated buffers, declare a new hypercall + * buffer and call xc_hypercall_buffer_array_get(). + * + * Destroy the array with xc_hypercall_buffer_array_destroy() to free + * the array and all its alocated hypercall buffers. + */ +struct xc_hypercall_buffer_array; +typedef struct xc_hypercall_buffer_array xc_hypercall_buffer_array_t; + +xc_hypercall_buffer_array_t *xc_hypercall_buffer_array_create(xc_interface *xch, unsigned n); +void *xc__hypercall_buffer_array_alloc(xc_interface *xch, xc_hypercall_buffer_array_t *array, + unsigned index, xc_hypercall_buffer_t *hbuf, size_t size); +#define xc_hypercall_buffer_array_alloc(_xch, _array, _index, _name, _size) \ + xc__hypercall_buffer_array_alloc(_xch, _array, _index, HYPERCALL_BUFFER(_name), _size) +void *xc__hypercall_buffer_array_get(xc_interface *xch, xc_hypercall_buffer_array_t *array, + unsigned index, xc_hypercall_buffer_t *hbuf); +#define xc_hypercall_buffer_array_get(_xch, _array, _index, _name, _size) \ + xc__hypercall_buffer_array_get(_xch, _array, _index, HYPERCALL_BUFFER(_name)) +void xc_hypercall_buffer_array_destroy(xc_interface *xc, xc_hypercall_buffer_array_t *array); + +/* * CPUMAP handling */ typedef uint8_t *xc_cpumap_t; -- 1.7.2.5
From: David Vrabel <david.vrabel@citrix.com> Add xc_kexec_exec(), xc_kexec_get_ranges(), xc_kexec_load(), and xc_kexec_unload(). The load and unload calls require the v2 load and unload ops. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> --- tools/libxc/Makefile | 1 + tools/libxc/xc_kexec.c | 140 ++++++++++++++++++++++++++++++++++++++++++++++++ tools/libxc/xenctrl.h | 55 +++++++++++++++++++ 3 files changed, 196 insertions(+), 0 deletions(-) create mode 100644 tools/libxc/xc_kexec.c diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile index 512a994..528198e 100644 --- a/tools/libxc/Makefile +++ b/tools/libxc/Makefile @@ -31,6 +31,7 @@ CTRL_SRCS-y += xc_mem_access.c CTRL_SRCS-y += xc_memshr.c CTRL_SRCS-y += xc_hcall_buf.c CTRL_SRCS-y += xc_foreign_memory.c +CTRL_SRCS-y += xc_kexec.c CTRL_SRCS-y += xtl_core.c CTRL_SRCS-y += xtl_logger_stdio.c CTRL_SRCS-$(CONFIG_X86) += xc_pagetab.c diff --git a/tools/libxc/xc_kexec.c b/tools/libxc/xc_kexec.c new file mode 100644 index 0000000..81c0338 --- /dev/null +++ b/tools/libxc/xc_kexec.c @@ -0,0 +1,140 @@ +/****************************************************************************** + * xc_kexec.c + * + * API for loading and executing kexec images. + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; + * version 2.1 of the License. + * + * Copyright (C) 2013 Citrix Systems R&D Ltd. + */ +#include "xc_private.h" + +int xc_kexec_exec(xc_interface *xch, int type) +{ + DECLARE_HYPERCALL; + DECLARE_HYPERCALL_BUFFER(xen_kexec_exec_t, exec); + int ret = -1; + + exec = xc_hypercall_buffer_alloc(xch, exec, sizeof(*exec)); + if ( exec == NULL ) + { + PERROR("Count not alloc bounce buffer for kexec_exec hypercall"); + goto out; + } + + exec->type = type; + + hypercall.op = __HYPERVISOR_kexec_op; + hypercall.arg[0] = KEXEC_CMD_kexec; + hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(exec); + + ret = do_xen_hypercall(xch, &hypercall); + +out: + xc_hypercall_buffer_free(xch, exec); + + return ret; +} + +int xc_kexec_get_range(xc_interface *xch, int range, int nr, + uint64_t *size, uint64_t *start) +{ + DECLARE_HYPERCALL; + DECLARE_HYPERCALL_BUFFER(xen_kexec_range_t, get_range); + int ret = -1; + + get_range = xc_hypercall_buffer_alloc(xch, get_range, sizeof(*get_range)); + if ( get_range == NULL ) + { + PERROR("Could not alloc bounce buffer for kexec_get_range hypercall"); + goto out; + } + + get_range->range = range; + get_range->nr = nr; + + hypercall.op = __HYPERVISOR_kexec_op; + hypercall.arg[0] = KEXEC_CMD_kexec_get_range; + hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(get_range); + + ret = do_xen_hypercall(xch, &hypercall); + + *size = get_range->size; + *start = get_range->start; + +out: + xc_hypercall_buffer_free(xch, get_range); + + return ret; +} + +int xc_kexec_load(xc_interface *xch, uint8_t type, uint16_t arch, + uint64_t entry_maddr, + uint32_t nr_segments, xen_kexec_segment_t *segments) +{ + int ret = -1; + DECLARE_HYPERCALL; + DECLARE_HYPERCALL_BOUNCE(segments, sizeof(*segments) * nr_segments, + XC_HYPERCALL_BUFFER_BOUNCE_IN); + DECLARE_HYPERCALL_BUFFER(xen_kexec_load_t, load); + + if ( xc_hypercall_bounce_pre(xch, segments) ) + { + PERROR("Could not allocate bounce buffer for kexec load hypercall"); + goto out; + } + load = xc_hypercall_buffer_alloc(xch, load, sizeof(*load)); + if ( load == NULL ) + { + PERROR("Could not allocate buffer for kexec load hypercall"); + goto out; + } + + load->type = type; + load->arch = arch; + load->entry_maddr = entry_maddr; + load->nr_segments = nr_segments; + set_xen_guest_handle(load->segments, segments); + + hypercall.op = __HYPERVISOR_kexec_op; + hypercall.arg[0] = KEXEC_CMD_kexec_load; + hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(load); + + ret = do_xen_hypercall(xch, &hypercall); + +out: + xc_hypercall_buffer_free(xch, load); + xc_hypercall_bounce_post(xch, segments); + + return ret; +} + +int xc_kexec_unload(xc_interface *xch, int type) +{ + DECLARE_HYPERCALL; + DECLARE_HYPERCALL_BUFFER(xen_kexec_unload_t, unload); + int ret = -1; + + unload = xc_hypercall_buffer_alloc(xch, unload, sizeof(*unload)); + if ( unload == NULL ) + { + PERROR("Count not alloc buffer for kexec unload hypercall"); + goto out; + } + + unload->type = type; + + hypercall.op = __HYPERVISOR_kexec_op; + hypercall.arg[0] = KEXEC_CMD_kexec_unload; + hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(unload); + + ret = do_xen_hypercall(xch, &hypercall); + +out: + xc_hypercall_buffer_free(xch, unload); + + return ret; +} diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h index f74f5de..6af82f7 100644 --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -46,6 +46,7 @@ #include <xen/hvm/params.h> #include <xen/xsm/flask_op.h> #include <xen/tmem.h> +#include <xen/kexec.h> #include "xentoollog.h" @@ -2316,4 +2317,58 @@ int xc_compression_uncompress_page(xc_interface *xch, char *compbuf, unsigned long compbuf_size, unsigned long *compbuf_pos, char *dest); +/* + * Execute an image previously loaded with xc_kexec_load(). + * + * Does not return on success. + * + * Fails with: + * ENOENT if the specified image has not been loaded. + */ +int xc_kexec_exec(xc_interface *xch, int type); + +/* + * Find the machine address and size of certain memory areas. + * + * KEXEC_RANGE_MA_CRASH crash area + * KEXEC_RANGE_MA_XEN Xen itself + * KEXEC_RANGE_MA_CPU CPU note for CPU number ''nr'' + * KEXEC_RANGE_MA_XENHEAP xenheap + * KEXEC_RANGE_MA_EFI_MEMMAP EFI Memory Map + * KEXEC_RANGE_MA_VMCOREINFO vmcoreinfo + * + * Fails with: + * EINVAL if the range or CPU number isn''t valid. + */ +int xc_kexec_get_range(xc_interface *xch, int range, int nr, + uint64_t *size, uint64_t *start); + +/* + * Load a kexec image into memory. + * + * The image may be of type KEXEC_TYPE_DEFAULT (executed on request) + * or KEXEC_TYPE_CRASH (executed on a crash). + * + * The image architecture may be a 32-bit variant of the hypervisor + * architecture (e.g, EM_386 on a x86-64 hypervisor). + * + * Fails with: + * ENOMEM if there is insufficient memory for the new image. + * EINVAL if the image does not fit into the crash area or the entry + * point isn''t within one of segments. + * EBUSY if another image is being executed. + */ +int xc_kexec_load(xc_interface *xch, uint8_t type, uint16_t arch, + uint64_t entry_maddr, + uint32_t nr_segments, xen_kexec_segment_t *segments); + +/* + * Unload a kexec image. + * + * This prevents a KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH image from + * being executed. The crash images are not cleared from the crash + * region. + */ +int xc_kexec_unload(xc_interface *xch, int type); + #endif /* XENCTRL_H */ -- 1.7.2.5
David Vrabel
2013-Jun-24 17:42 UTC
[PATCH 09/10] x86: check kexec relocation code fits in a page
From: David Vrabel <david.vrabel@citrix.com> The kexec relocation (control) code must fit in a single page so add a link time check for this. Signed-off-by: David Vrabel <david.vrabel@citrix.com> --- xen/arch/x86/xen.lds.S | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S index d959941..eebed01 100644 --- a/xen/arch/x86/xen.lds.S +++ b/xen/arch/x86/xen.lds.S @@ -186,3 +186,7 @@ SECTIONS .stab.indexstr 0 : { *(.stab.indexstr) } .comment 0 : { *(.comment) } } + +#ifdef CONFIG_KEXEC +ASSERT(__kexec_reloc_size <= PAGE_SIZE, "kexec control code is too large") +#endif -- 1.7.2.5
From: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> --- MAINTAINERS | 8 ++++++++ 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 843f9e3..0082792 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -190,6 +190,14 @@ X: xen/drivers/passthrough/amd/ X: xen/drivers/passthrough/vtd/ F: xen/include/xen/iommu.h +KEXEC +M: David Vrabel <david.vrabel@citrix.com> +S: Supported +F: xen/common/{kexec,kimage}.c +F: xen/include/{kexec,kimage}.h +F: xen/arch/x86/machine_kexec.c +F: xen/arch/x86/x86_64/kexec_reloc.S + LINUX (PV_OPS) M: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> S: Supported -- 1.7.2.5
Andrew Cooper
2013-Jun-24 20:31 UTC
Re: [PATCHv6 0/10] kexec: extend kexec hypercall for use with pv-ops kernels
On 24/06/2013 18:42, David Vrabel wrote:> The series (for Xen 4.4) improves the kexec hypercall by making Xen > responsible for loading and relocating the image. This allows kexec > to be usable by pv-ops kernels and should allow kexec to be usable > from a HVM or PVH privileged domain. > > The first patch is a simple clean-up. > > The second patch allows hypercall structures to be ABI compatible > between 32- and 64-bit guests (by reusing stuff present for domctls > and sysctls). This seems better than having to keep adding compat > handling for new hypercalls etc. > > Patch 3 introduces the new ABI. > > Patch 4 and 5 nearly completely reimplement the kexec load, unload and > exec sub-ops. The old load_v1 sub-op is then implemented on top of > the new code. > > Patch 6 calls the kexec image when dom0 crashes. This avoids having > to alter dom0 kernels to do a exec sub-op call on crash -- a > SHUTDOWN_crash by dom0 will trigger the kexec. > > Patches 7 and 8 add the libxc API for the kexec calls. These have > been acked-by Ian Campbell already. > > Patch 9 adds a link time check for the size of the relocate code. > > Patch 10 adds myself as the maintainer for kexec in Xen. > > The required patch series for kexec-tools have previously been posted > and this series has been rebased on the latest kexec-tools and is > available from the xen-v3 branch of: > > http://xenbits.xen.org/gitweb/?p=people/dvrabel/kexec-tools.git;a=summaryAs you have picked up my one bugfix from the patch queue, please consider the entire series: Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> ~Andrew> > Changes since v5: > > - Fix double free in KEXEC_load_v1 failure path. > - Only copy the relocation code and not the whole page. > - Add myself as the kexec maintainer. > > Changes since v4 (v5 was not posted to the list): > > - _rsvd -> _pad in one of the public ABI structures. > - Fix bug where trailing pages were not zeroed. This fixes loading a > 64-bit Linux kernel using a more recent version of kexec-tools. > - Check the relocation code fits into a page at link time. > > Changes since v3: > > - Use paddr_t and page_to_maddr() etc. for portability. > - Add explicit padding to hypercall structures where required. > - Minor cleanup of the kexec_reloc assembly. > - Print a message before exec''ing a crash image. > - Style fixes (tabs, trailing whitespace) and typos. > - Fix a bug where using the V1 interface and unloading a image may crash. > > Changes since v2: > > - Provide old struct xen_kexec_load if __XEN_INTERFACE_VERSION__ < 4.3 > - Adjust new struct xen_kexec_load to avoid unnecessary padding. > - Use domheap pages for the image and control pages. > - Remove the DBG() macros from the reloc code. > > David > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Jan Beulich
2013-Jun-25 07:42 UTC
Re: [PATCH 02/10] xen: make GUEST_HANDLE_64() and uint64_aligned_t available everywhere
>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: > --- a/xen/include/public/arch-x86/xen-x86_32.h > +++ b/xen/include/public/arch-x86/xen-x86_32.h > @@ -91,8 +91,7 @@ > #define machine_to_phys_mapping ((unsigned long *)MACH2PHYS_VIRT_START) > #endif > > -/* 32-/64-bit invariability for control interfaces (domctl/sysctl). */ > -#if defined(__XEN__) || defined(__XEN_TOOLS__) > +/* 32-/64-bit invariability. */ > #undef ___DEFINE_XEN_GUEST_HANDLE > #define ___DEFINE_XEN_GUEST_HANDLE(name, type) \ > typedef struct { type *p; } \ > @@ -107,7 +106,6 @@ > #define uint64_aligned_t uint64_t __attribute__((aligned(8)))This line is the reason why such a change is not acceptable: We require the headers to not use gcc extensions outside of regions guarded by dependencies on __XEN__ and/or __XEN_TOOLS__ (which we know/require will always be built by gcc compatible tool chains).> #define __XEN_GUEST_HANDLE_64(name) __guest_handle_64_ ## name > #define XEN_GUEST_HANDLE_64(name) __XEN_GUEST_HANDLE_64(name) > -#endif > > #ifndef __ASSEMBLY__I''m afraid you''ll need to find a way to do what you want in the kexec interface with the traditional manual padding approach. Jan
Jan Beulich
2013-Jun-25 07:45 UTC
Re: [PATCH 03/10] kexec: add public interface for improved load/unload sub-ops
>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: > @@ -152,6 +152,64 @@ typedef struct xen_kexec_range { > unsigned long start; > } xen_kexec_range_t; > > +#if __XEN_INTERFACE_VERSION__ >= 0x00040300... >= 0x00040400 Jan> +/* > + * A contiguous chunk of a kexec image and it''s destination machine > + * address. > + */ > +typedef struct xen_kexec_segment { > + XEN_GUEST_HANDLE_64(const_void) buf; > + uint64_t buf_size; > + uint64_t dest_maddr; > + uint64_t dest_size; > +} xen_kexec_segment_t; > +DEFINE_XEN_GUEST_HANDLE(xen_kexec_segment_t); > + > +/* > + * Load a kexec image into memory. > + * > + * For KEXEC_TYPE_DEFAULT images, the segments may be anywhere in RAM. > + * The image is relocated prior to being executed. > + * > + * For KEXEC_TYPE_CRASH images, each segment of the image must reside > + * in the memory region reserved for kexec (KEXEC_RANGE_MA_CRASH) and > + * the entry point must be within the image. The caller is responsible > + * for ensuring that multiple images do not overlap. > + */ > + > +#define KEXEC_CMD_kexec_load 4 > +typedef struct xen_kexec_load { > + uint8_t type; /* One of KEXEC_TYPE_* */ > + uint8_t _pad; > + uint16_t arch; /* ELF machine type (EM_*). */ > + uint32_t nr_segments; > + XEN_GUEST_HANDLE_64(xen_kexec_segment_t) segments; > + uint64_t entry_maddr; /* image entry point machine address. */ > +} xen_kexec_load_t; > +DEFINE_XEN_GUEST_HANDLE(xen_kexec_load_t); > + > +/* > + * Unload a kexec image. > + * > + * Type must be one of KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH. > + */ > +#define KEXEC_CMD_kexec_unload 5 > +typedef struct xen_kexec_unload { > + uint8_t type; > +} xen_kexec_unload_t; > +DEFINE_XEN_GUEST_HANDLE(xen_kexec_unload_t); > + > +#else /* __XEN_INTERFACE_VERSION__ < 0x00040300 */ > + > +#define KEXEC_CMD_kexec_load KEXEC_CMD_kexec_load_v1 > +#define KEXEC_CMD_kexec_unload KEXEC_CMD_kexec_unload_v1 > +typedef struct xen_kexec_load { > + int type; > + xen_kexec_image_t image; > +} xen_kexec_load_t; > + > +#endif > + > #endif /* _XEN_PUBLIC_KEXEC_H */ > > /*
Jan Beulich
2013-Jun-25 07:54 UTC
Re: [PATCH 04/10] kexec: add infrastructure for handling kexec images
>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: > +static struct page_info *kimage_alloc_zeroed_page(unsigned memflags) > +{ > + struct page_info *page; > + > + page = alloc_domheap_page(NULL, memflags); > + if ( page == NULL )Please be consistent - either always use "== NULL", ...> + return NULL; > + > + clear_domain_page(page_to_mfn(page)); > + > + return page; > +} > + > +static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry, > + unsigned long nr_segments, > + xen_kexec_segment_t *segments, uint8_t type) > +{ > + struct kexec_image *image; > + unsigned long i; > + int result; > + > + /* Allocate a controlling structure */ > + result = -ENOMEM; > + image = xzalloc(typeof(*image)); > + if ( !image )... or (preferably afaic) always use "!".> + goto out; > + > + image->control_page = ~0; /* By default this does not apply */Use a recognizable #define instead?> + image->entry_maddr = entry; > + image->type = type; > + image->nr_segments = nr_segments; > + image->segments = segments; > + > + INIT_PAGE_LIST_HEAD(&image->control_pages); > + INIT_PAGE_LIST_HEAD(&image->dest_pages); > + INIT_PAGE_LIST_HEAD(&image->unusable_pages); > + > + /* > + * Verify we have good destination addresses. The caller is > + * responsible for making certain we don''t attempt to load > + * the new image into invalid or reserved areas of RAM. This > + * just verifies it is an address we can use. > + * > + * Since the kernel does everything in page size chunks ensure > + * the destination addresses are page aligned. Too many > + * special cases crop of when we don''t do this. The most > + * insidious is getting overlapping destination addresses > + * simply because addresses are changed to page size > + * granularity. > + */ > + result = -EADDRNOTAVAIL; > + for ( i = 0; i < nr_segments; i++ ) > + { > + paddr_t mstart, mend; > + > + mstart = image->segments[i].dest_maddr; > + mend = mstart + image->segments[i].dest_size; > + if ( (mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK) )Expressions like this can be abbreviated to if ( (mstart | mend) & ~PAGE_MASK )> + goto out; > + } > + > + /* Verify our destination addresses do not overlap.Coding style (the comment immediately above is done correctly). There are more of these further down... There are also a few cases of bogus blank lines - while the coding style document doesn''t say anything in that regard, excessive amounts of blank lines increase the risk of future patches getting applied incorrectly due to the patch context becoming less meaningful. Jan
Jan Beulich
2013-Jun-25 08:31 UTC
Re: [PATCH 05/10] kexec: extend hypercall with improved load/unload ops
>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: > +static void init_level2_page(l2_pgentry_t *l2, unsigned long addr) > +{ > + unsigned long end_addr; > + > + addr &= PAGE_MASK; > + end_addr = addr + L2_PAGETABLE_ENTRIES * (1ul << L2_PAGETABLE_SHIFT);For one, with the code below, this can''t be right: "addr" is getting only page aligned, but the rest of the code assumes it is suitable as a super-page. Further, you''re risking mapping non-memory here, as the caller doesn''t pass the true end of the memory block to be mapped. And then the expression is odd - why not just end_addr = addr + (L2_PAGETABLE_ENTRIES << L2_PAGETABLE_SHIFT);> +static int init_level4_page(struct kexec_image *image, l4_pgentry_t *l4, > + unsigned long addr, unsigned long last_addr) > +{ > + unsigned long end_addr; > + int result; > + > + addr &= PAGE_MASK; > + end_addr = addr + L4_PAGETABLE_ENTRIES * (1ul << L4_PAGETABLE_SHIFT); > + > + while ( (addr < last_addr) && (addr < end_addr) ) > + { > + struct page_info *l3_page; > + l3_pgentry_t *l3; > + > + l3_page = kimage_alloc_control_page(image, 0); > + if ( !l3_page ) > + return -ENOMEM; > + l3 = __map_domain_page(l3_page); > + result = init_level3_page(image, l3, addr, last_addr); > + unmap_domain_page(l3); > + if (result)Coding style.> +static int init_transition_pgtable(struct kexec_image *image, l4_pgentry_t *l4) > +{ > + struct page_info *l3_page; > + struct page_info *l2_page; > + struct page_info *l1_page; > + unsigned long vaddr, paddr; > + l3_pgentry_t *l3 = NULL; > + l2_pgentry_t *l2 = NULL; > + l1_pgentry_t *l1 = NULL; > + int ret = -ENOMEM; > + > + vaddr = (unsigned long)kexec_reloc; > + paddr = page_to_maddr(image->control_code_page); > + > + l4 += l4_table_offset(vaddr); > + if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) ) > + { > + l3_page = kimage_alloc_control_page(image, 0); > + if ( !l3_page ) > + goto out; > + l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR)); > + } > + else > + l3_page = l4e_get_page(*l4); > + > + l3 = __map_domain_page(l3_page); > + l3 += l3_table_offset(vaddr); > + if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) ) > + { > + l2_page = kimage_alloc_control_page(image, 0); > + if ( !l2_page ) > + goto out; > + l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR)); > + } > + else > + l2_page = l3e_get_page(*l3);Afaict you''re done using "l3" here, so you should unmap it in order to reduce the pressure on the domain page mapping resources.> + > + l2 = __map_domain_page(l2_page); > + l2 += l2_table_offset(vaddr); > + if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) ) > + { > + l1_page = kimage_alloc_control_page(image, 0); > + if ( !l1_page ) > + goto out; > + l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR)); > + } > + else > + l1_page = l2e_get_page(*l2);Same for "l2" at this point.> +static int build_reloc_page_table(struct kexec_image *image) > +{ > + struct page_info *l4_page; > + l4_pgentry_t *l4; > + int result; > + > + l4_page = kimage_alloc_control_page(image, 0); > + if ( !l4_page ) > + return -ENOMEM; > + l4 = __map_domain_page(l4_page); > + > + result = init_level4_page(image, l4, 0, max_page << PAGE_SHIFT);What about holes in the physical address space - not just the MMIO hole below 4Gb is a problem here, but also discontiguous physical memory.> --- /dev/null > +++ b/xen/arch/x86/x86_64/kexec_reloc.S > @@ -0,0 +1,211 @@ > +/* > + * Relocate a kexec_image to its destination and call it. > + * > + * Copyright (C) 2013 Citrix Systems R&D Ltd. > + * > + * Portions derived from Linux''s arch/x86/kernel/relocate_kernel_64.S. > + * > + * Copyright (C) 2002-2005 Eric Biederman <ebiederm@xmission.com> > + * > + * This source code is licensed under the GNU General Public License, > + * Version 2. See the file COPYING for more details. > + */ > +#include <xen/config.h> > + > +#include <asm/asm_defns.h> > +#include <asm/msr.h> > +#include <asm/page.h> > +#include <asm/machine_kexec.h> > + > + .text > + .align PAGE_SIZE > + .code64 > + > +ENTRY(kexec_reloc) > + /* %rdi - code page maddr */ > + /* %rsi - page table maddr */ > + /* %rdx - indirection page maddr */ > + /* %rcx - entry maddr */ > + /* %r8 - flags */ > + > + movq %rdx, %rbx > + > + /* Setup stack. */ > + leaq (reloc_stack - kexec_reloc)(%rdi), %rsp > + > + /* Load reloc page table. */ > + movq %rsi, %cr3 > + > + /* Jump to identity mapped code. */ > + leaq (identity_mapped - kexec_reloc)(%rdi), %rax > + jmpq *%rax > + > +identity_mapped: > + pushq %rcx > + pushq %rbx > + pushq %rsi > + pushq %rdi > + > + /* > + * Set cr0 to a known state: > + * - Paging enabled > + * - Alignment check disabled > + * - Write protect disabled > + * - No task switch > + * - Don''t do FP software emulation. > + * - Proctected mode enabled > + */ > + movq %cr0, %rax > + andq $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %rax > + orl $(X86_CR0_PG | X86_CR0_PE), %eaxEither "andq" and "orq" or "andl" and "orl".> + movq %rax, %cr0 > + > + /* > + * Set cr4 to a known state: > + * - physical address extension enabled > + */ > + movq $X86_CR4_PAE, %rax"movl" suffices here.> + movq %rax, %cr4 > + > + movq %rbx, %rdi > + call relocate_pages > + > + popq %rdi > + popq %rsi > + popq %rbx > + popq %rcx > + > + /* Need to switch to 32-bit mode? */ > + testq $KEXEC_RELOC_FLAG_COMPAT, %r8 > + jnz call_32_bit > + > +call_64_bit: > + /* Call the image entry point. This should never return. */ > + call *%rcx > + ud2 > + > +call_32_bit: > + /* Setup IDT. */ > + lidt compat_mode_idt(%rip) > + > + /* Load compat GDT. */ > + leaq (compat_mode_gdt - kexec_reloc)(%rdi), %rax > + movq %rax, (compat_mode_gdt_desc + 2)(%rip) > + lgdt compat_mode_gdt_desc(%rip) > + > + /* Relocate compatibility mode entry point address. */ > + leal (compatibility_mode - kexec_reloc)(%edi), %eax > + movl %eax, compatibility_mode_far(%rip) > + > + /* Enter compatibility mode. */ > + ljmp *compatibility_mode_far(%rip)As you''re elsewhere using mnemonic suffixes elsewhere even when not strictly needed, for consistency I''d recommend using one on this instruction (and perhaps also on the lidt/lgdt above) too.> + > +relocate_pages: > + /* %rdi - indirection page maddr */ > + cld > + movq %rdi, %rcx > + xorq %rdi, %rdi > + xorq %rsi, %rsiI guess performance and code size aren''t of highest importance here, but "xorl" would suffice in both of the above lines.> + jmp 1f > + > +0: /* top, read another word for the indirection page */ > + > + movq (%rbx), %rcx > + addq $8, %rbx > +1: > + testq $0x1, %rcx /* is it a destination page? */And "testl" (or even "testb") would be sufficient here. There are more pointless uses of the "q" suffix further down. In any event the 0x1 here and the other flags tested for below will want to become manifest constants.> + jz 2f > + movq %rcx, %rdi > + andq $0xfffffffffffff000, %rdiThe number of "f"-s here being correct can''t be seen without actually counting them - either use a manifest constant like PAGE_MASK, or at least write "$~0xfff".> + jmp 0b > +2: > + testq $0x2, %rcx /* is it an indirection page? */ > + jz 2f > + movq %rcx, %rbx > + andq $0xfffffffffffff000, %rbx > + jmp 0b > +2: > + testq $0x4, %rcx /* is it the done indicator? */ > + jz 2f > + jmp 3fJust a single (inverse) conditional branch please. And there are too many "2:" labels here in close succession.> +2: > + testq $0x8, %rcx /* is it the source indicator? */ > + jz 2f > + movq %rcx, %rsi /* For ever source page do a copy */"every"?> + andq $0xfffffffffffff000, %rsi > + movq $512, %rcx > + rep movsq > + jmp 0b > +2: > + testq $0x10, %rcx /* is it the zero indicator? */ > + jz 0b /* Ignore it otherwise */ > + movq $512, %rcx /* Zero the destination page. */ > + xorq %rax, %rax > + rep stosq > + jmp 0b > +3: > + ret > + > + .code32 > + > +compatibility_mode: > + /* Setup some sane segments. */ > + movl $0x0008, %eax > + movl %eax, %ds > + movl %eax, %es > + movl %eax, %fs > + movl %eax, %gs > + movl %eax, %ss > + > + movl %ecx, %ebp > + > + /* Disable paging and therefore leave 64 bit mode. */ > + movl %cr0, %eax > + andl $~X86_CR0_PG, %eax > + movl %eax, %cr0 > + > + /* Disable long mode */ > + movl $MSR_EFER, %ecx > + rdmsr > + andl $~EFER_LME, %eax > + wrmsr > + > + /* Clear cr4 to disable PAE. */ > + movl $0, %eaxxorl %eax, %eax> + movl %eax, %cr4 > + > + /* Call the image entry point. This should never return. */ > + call *%ebp > + ud2 > + > + .align 16Why? 4 is all you need.> +compatibility_mode_far: > + .long 0x00000000 /* set in call_32_bit above */ > + .word 0x0010 > + > + .align 16Even more so here. If you care for alignment, you want 2 mod 8 here.> +compat_mode_gdt_desc: > + .word (3*8)-1 > + .quad 0x0000000000000000 /* set in call_32_bit above */ > + > + .align 16And just 8 here.> +compat_mode_gdt: > + .quad 0x0000000000000000 /* null */ > + .quad 0x00cf92000000ffff /* 0x0008 ring 0 data */ > + .quad 0x00cf9a000000ffff /* 0x0010 ring 0 code, compatibility */ > + > +compat_mode_idt:compat_mode_idt_desc: And if you care for alignment, 2 mod 8 again.> + .word 0 /* limit */ > + .long 0 /* base */ > + > + /* > + * 16 words of stack are more than enough. > + */ > + .fill 16,8,0 > +reloc_stack:And now you don''t care for the stack being mis-aligned? Jan
Jan Beulich
2013-Jun-25 08:33 UTC
Re: [PATCH 09/10] x86: check kexec relocation code fits in a page
>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: > --- a/xen/arch/x86/xen.lds.S > +++ b/xen/arch/x86/xen.lds.S > @@ -186,3 +186,7 @@ SECTIONS > .stab.indexstr 0 : { *(.stab.indexstr) } > .comment 0 : { *(.comment) } > } > + > +#ifdef CONFIG_KEXEC > +ASSERT(__kexec_reloc_size <= PAGE_SIZE, "kexec control code is too large") > +#endifI don''t recall having seen a mechanism to disable CONFIG_KEXEC, so why the conditional? Jan
Andrew Cooper
2013-Jun-25 09:31 UTC
Re: [PATCH 09/10] x86: check kexec relocation code fits in a page
On 25/06/13 09:33, Jan Beulich wrote:>>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: >> --- a/xen/arch/x86/xen.lds.S >> +++ b/xen/arch/x86/xen.lds.S >> @@ -186,3 +186,7 @@ SECTIONS >> .stab.indexstr 0 : { *(.stab.indexstr) } >> .comment 0 : { *(.comment) } >> } >> + >> +#ifdef CONFIG_KEXEC >> +ASSERT(__kexec_reloc_size <= PAGE_SIZE, "kexec control code is too large") >> +#endif > I don''t recall having seen a mechanism to disable CONFIG_KEXEC, so > why the conditional? > > JanCONFIG_KEXEC exists in include/asm-x86/config.h, but it turns out not to compile if you disable it. I for one would not mind in the slightest if CONFIG_KEXEC disappeared. ~Andrew> > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
David Vrabel
2013-Jun-25 09:42 UTC
Re: [PATCH 02/10] xen: make GUEST_HANDLE_64() and uint64_aligned_t available everywhere
On 25/06/13 08:42, Jan Beulich wrote:>>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: >> --- a/xen/include/public/arch-x86/xen-x86_32.h >> +++ b/xen/include/public/arch-x86/xen-x86_32.h >> @@ -91,8 +91,7 @@ >> #define machine_to_phys_mapping ((unsigned long *)MACH2PHYS_VIRT_START) >> #endif >> >> -/* 32-/64-bit invariability for control interfaces (domctl/sysctl). */ >> -#if defined(__XEN__) || defined(__XEN_TOOLS__) >> +/* 32-/64-bit invariability. */ >> #undef ___DEFINE_XEN_GUEST_HANDLE >> #define ___DEFINE_XEN_GUEST_HANDLE(name, type) \ >> typedef struct { type *p; } \ >> @@ -107,7 +106,6 @@ >> #define uint64_aligned_t uint64_t __attribute__((aligned(8))) > > This line is the reason why such a change is not acceptable: We > require the headers to not use gcc extensions outside of regions > guarded by dependencies on __XEN__ and/or __XEN_TOOLS__ (which > we know/require will always be built by gcc compatible tool chains).I did this because this is identical to what ARM is doing. I think we do what a guest handle type that is always 64 bits long. For x86, perhaps something like (but with a better name): #define ___DEFINE_XEN_GUEST_HANDLE(name, type) \ typedef struct { type *p; } \ __guest_handle_ ## name; \ #if defined(__XEN__) || (__XEN_TOOLS__) typedef struct { union { type *p; uint64_aligned_t q; }; } \ __guest_handle_64_ ## name \ #endif typedef struct { union { type *p; uint64_t q; }; } \ __guest_handle_new_ ## name #undef set_xen_guest_handle_raw #define set_xen_guest_handle_raw(hnd, val) \ do { if ( sizeof(hnd) == 8 ) *(uint64_t *)&(hnd) = 0; \ (hnd).p = val; \ } while ( 0 ) #if defined(__XEN__) || (__XEN_TOOLS__) #define uint64_aligned_t uint64_t __attribute__((aligned(8))) #define __XEN_GUEST_HANDLE_64(name) __guest_handle_64_ ## name #define XEN_GUEST_HANDLE_64(name) __XEN_GUEST_HANDLE_64(name) #endif #define __XEN_GUEST_HANDLE_NEW(name) __guest_handle_new_ ## name /* This must be aligned to 8 bytes with padding if necessary. */ #define XEN_GUEST_HANDLE_NEW(name) __XEN_GUEST_HANDLE_NEW(name)>> #define __XEN_GUEST_HANDLE_64(name) __guest_handle_64_ ## name >> #define XEN_GUEST_HANDLE_64(name) __XEN_GUEST_HANDLE_64(name) >> -#endif >> >> #ifndef __ASSEMBLY__ > > I''m afraid you''ll need to find a way to do what you want in the > kexec interface with the traditional manual padding approach.This is fine. The kexec interface has the necessary padding and doesn''t need the the aligned attribute. David
Jan Beulich
2013-Jun-25 11:36 UTC
Re: [PATCH 02/10] xen: make GUEST_HANDLE_64() and uint64_aligned_t available everywhere
>>> On 25.06.13 at 11:42, David Vrabel <david.vrabel@citrix.com> wrote: > On 25/06/13 08:42, Jan Beulich wrote: >>>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: >>> #define uint64_aligned_t uint64_t __attribute__((aligned(8))) >> >> This line is the reason why such a change is not acceptable: We >> require the headers to not use gcc extensions outside of regions >> guarded by dependencies on __XEN__ and/or __XEN_TOOLS__ (which >> we know/require will always be built by gcc compatible tool chains). > > I did this because this is identical to what ARM is doing. > > I think we do what a guest handle type that is always 64 bits long. For > x86, perhaps something like (but with a better name): > > #define ___DEFINE_XEN_GUEST_HANDLE(name, type) \ > typedef struct { type *p; } \ > __guest_handle_ ## name; \ > #if defined(__XEN__) || (__XEN_TOOLS__) > typedef struct { union { type *p; uint64_aligned_t q; }; } \ > __guest_handle_64_ ## name \ > #endif > typedef struct { union { type *p; uint64_t q; }; } \ > __guest_handle_new_ ## nameThe uint64_t here ...> #undef set_xen_guest_handle_raw > #define set_xen_guest_handle_raw(hnd, val) \ > do { if ( sizeof(hnd) == 8 ) *(uint64_t *)&(hnd) = 0; \ > (hnd).p = val; \ > } while ( 0 ) > > #if defined(__XEN__) || (__XEN_TOOLS__) > #define uint64_aligned_t uint64_t __attribute__((aligned(8))) > #define __XEN_GUEST_HANDLE_64(name) __guest_handle_64_ ## name > #define XEN_GUEST_HANDLE_64(name) __XEN_GUEST_HANDLE_64(name) > #endif > > #define __XEN_GUEST_HANDLE_NEW(name) __guest_handle_new_ ## name > /* This must be aligned to 8 bytes with padding if necessary. */ > #define XEN_GUEST_HANDLE_NEW(name) __XEN_GUEST_HANDLE_NEW(name)... does in no way satisfy the comment here, so what''s the point?>> I''m afraid you''ll need to find a way to do what you want in the >> kexec interface with the traditional manual padding approach. > > This is fine. The kexec interface has the necessary padding and doesn''t > need the the aligned attribute.Not afaict, unless you meant if substituting XEN_GUEST_HANDLE_NEW() (rather than XEN_GUEST_HANDLE()) for XEN_GUEST_HANDLE_64(). Jan
Jan Beulich
2013-Jun-25 11:38 UTC
Re: [PATCH 09/10] x86: check kexec relocation code fits in a page
>>> On 25.06.13 at 11:31, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 25/06/13 09:33, Jan Beulich wrote: >>>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: >>> --- a/xen/arch/x86/xen.lds.S >>> +++ b/xen/arch/x86/xen.lds.S >>> @@ -186,3 +186,7 @@ SECTIONS >>> .stab.indexstr 0 : { *(.stab.indexstr) } >>> .comment 0 : { *(.comment) } >>> } >>> + >>> +#ifdef CONFIG_KEXEC >>> +ASSERT(__kexec_reloc_size <= PAGE_SIZE, "kexec control code is too large") >>> +#endif >> I don''t recall having seen a mechanism to disable CONFIG_KEXEC, so >> why the conditional? > > CONFIG_KEXEC exists in include/asm-x86/config.h, but it turns out not to > compile if you disable it.This is more of an announcement than a knob for disabling (such that e.g. generic code can exclude respective pieces from getting built).> I for one would not mind in the slightest if CONFIG_KEXEC disappeared.We should keep it at least as long as ARM doesn''t support it, and perhaps even after to be prepared for new ports that (initially) don''t have the necessary support bits. Jan
David Vrabel
2013-Jun-25 13:17 UTC
Re: [PATCH 02/10] xen: make GUEST_HANDLE_64() and uint64_aligned_t available everywhere
On 25/06/13 12:36, Jan Beulich wrote:>>>> On 25.06.13 at 11:42, David Vrabel <david.vrabel@citrix.com> wrote: >> On 25/06/13 08:42, Jan Beulich wrote: >>>>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: >>>> #define uint64_aligned_t uint64_t __attribute__((aligned(8))) >>> >>> This line is the reason why such a change is not acceptable: We >>> require the headers to not use gcc extensions outside of regions >>> guarded by dependencies on __XEN__ and/or __XEN_TOOLS__ (which >>> we know/require will always be built by gcc compatible tool chains). >> >> I did this because this is identical to what ARM is doing. >> >> I think we do what a guest handle type that is always 64 bits long. For >> x86, perhaps something like (but with a better name): >> >> #define ___DEFINE_XEN_GUEST_HANDLE(name, type) \ >> typedef struct { type *p; } \ >> __guest_handle_ ## name; \ >> #if defined(__XEN__) || (__XEN_TOOLS__) >> typedef struct { union { type *p; uint64_aligned_t q; }; } \ >> __guest_handle_64_ ## name \ >> #endif >> typedef struct { union { type *p; uint64_t q; }; } \ >> __guest_handle_new_ ## name > > The uint64_t here ... > >> #undef set_xen_guest_handle_raw >> #define set_xen_guest_handle_raw(hnd, val) \ >> do { if ( sizeof(hnd) == 8 ) *(uint64_t *)&(hnd) = 0; \ >> (hnd).p = val; \ >> } while ( 0 ) >> >> #if defined(__XEN__) || (__XEN_TOOLS__) >> #define uint64_aligned_t uint64_t __attribute__((aligned(8))) >> #define __XEN_GUEST_HANDLE_64(name) __guest_handle_64_ ## name >> #define XEN_GUEST_HANDLE_64(name) __XEN_GUEST_HANDLE_64(name) >> #endif >> >> #define __XEN_GUEST_HANDLE_NEW(name) __guest_handle_new_ ## name >> /* This must be aligned to 8 bytes with padding if necessary. */ >> #define XEN_GUEST_HANDLE_NEW(name) __XEN_GUEST_HANDLE_NEW(name) > > ... does in no way satisfy the comment here, so what''s the point?The comment is unclear, sorry. /* A structure containing this type of guest handle must align the field to 8 bytes, using padding fields as necessary. */>>> I''m afraid you''ll need to find a way to do what you want in the >>> kexec interface with the traditional manual padding approach. >> >> This is fine. The kexec interface has the necessary padding and doesn''t >> need the the aligned attribute. > > Not afaict, unless you meant if substituting XEN_GUEST_HANDLE_NEW() > (rather than XEN_GUEST_HANDLE()) for XEN_GUEST_HANDLE_64().Yes. David
Jan Beulich
2013-Jun-25 13:53 UTC
Re: [PATCH 02/10] xen: make GUEST_HANDLE_64() and uint64_aligned_t available everywhere
>>> On 25.06.13 at 15:17, David Vrabel <david.vrabel@citrix.com> wrote: > On 25/06/13 12:36, Jan Beulich wrote: >>>>> On 25.06.13 at 11:42, David Vrabel <david.vrabel@citrix.com> wrote: >>> typedef struct { union { type *p; uint64_t q; }; } \ >>> __guest_handle_new_ ## name >> >> The uint64_t here ... >> >>> #undef set_xen_guest_handle_raw >>> #define set_xen_guest_handle_raw(hnd, val) \ >>> do { if ( sizeof(hnd) == 8 ) *(uint64_t *)&(hnd) = 0; \ >>> (hnd).p = val; \ >>> } while ( 0 ) >>> >>> #if defined(__XEN__) || (__XEN_TOOLS__) >>> #define uint64_aligned_t uint64_t __attribute__((aligned(8))) >>> #define __XEN_GUEST_HANDLE_64(name) __guest_handle_64_ ## name >>> #define XEN_GUEST_HANDLE_64(name) __XEN_GUEST_HANDLE_64(name) >>> #endif >>> >>> #define __XEN_GUEST_HANDLE_NEW(name) __guest_handle_new_ ## name >>> /* This must be aligned to 8 bytes with padding if necessary. */ >>> #define XEN_GUEST_HANDLE_NEW(name) __XEN_GUEST_HANDLE_NEW(name) >> >> ... does in no way satisfy the comment here, so what''s the point? > > The comment is unclear, sorry. > > /* A structure containing this type of guest handle must align the > field to 8 bytes, using padding fields as necessary. */Okay, now I understand your intention at least. However, I still don''t see a point in doing what you try to do - the consumer still has to add stuff along with the (oddly named) new handle type, so why not having it take care of padding _and_ sizing? Jan
David Vrabel
2013-Jun-25 14:30 UTC
Re: [PATCH 05/10] kexec: extend hypercall with improved load/unload ops
On 25/06/13 09:31, Jan Beulich wrote:>>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: >> >> +static int init_transition_pgtable(struct kexec_image *image, l4_pgentry_t *l4)[...]>> + l3 = __map_domain_page(l3_page); >> + l3 += l3_table_offset(vaddr); >> + if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) ) >> + { >> + l2_page = kimage_alloc_control_page(image, 0); >> + if ( !l2_page ) >> + goto out; >> + l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR)); >> + } >> + else >> + l2_page = l3e_get_page(*l3); > > Afaict you''re done using "l3" here, so you should unmap it in order > to reduce the pressure on the domain page mapping resources.The unmaps are grouped at the end to make the error paths simpler and I would prefer to keep it like this. This is only using 4 entries. Are we really that short?>> +static int build_reloc_page_table(struct kexec_image *image) >> +{ >> + struct page_info *l4_page; >> + l4_pgentry_t *l4; >> + int result; >> + >> + l4_page = kimage_alloc_control_page(image, 0); >> + if ( !l4_page ) >> + return -ENOMEM; >> + l4 = __map_domain_page(l4_page); >> + >> + result = init_level4_page(image, l4, 0, max_page << PAGE_SHIFT); > > What about holes in the physical address space - not just the > MMIO hole below 4Gb is a problem here, but also discontiguous > physical memory.I don''t see a problem with creating mappings for non-RAM regions. The discontiguous physical memory is a problem though. I think I''ll solve this by specifying that images are only executed with the first 4 GiB of physical address space linearly mapped. If this turns out not to be enough then the mappings can be extended without breaking existing tools or images.>> --- /dev/null >> +++ b/xen/arch/x86/x86_64/kexec_reloc.S[...]> And just 8 here.I seem to recall reading that some processors needed 16 byte alignment for the GDT. I may be misremembering or this was for an older processor that Xen no longer supports.>> +compat_mode_gdt: >> + .quad 0x0000000000000000 /* null */ >> + .quad 0x00cf92000000ffff /* 0x0008 ring 0 data */ >> + .quad 0x00cf9a000000ffff /* 0x0010 ring 0 code, compatibility */[...]>> + /* >> + * 16 words of stack are more than enough. >> + */ >> + .fill 16,8,0 >> +reloc_stack: > > And now you don''t care for the stack being mis-aligned?I do find the way you make some review comments as a question like this rather ambiguous. I guess I don''t care? But now I''m not sure if I should. David
David Vrabel
2013-Jun-25 14:48 UTC
Re: [PATCH 02/10] xen: make GUEST_HANDLE_64() and uint64_aligned_t available everywhere
On 25/06/13 14:53, Jan Beulich wrote:>>>> On 25.06.13 at 15:17, David Vrabel <david.vrabel@citrix.com> wrote: >> On 25/06/13 12:36, Jan Beulich wrote: >>>>>> On 25.06.13 at 11:42, David Vrabel <david.vrabel@citrix.com> wrote: >>>> typedef struct { union { type *p; uint64_t q; }; } \ >>>> __guest_handle_new_ ## name >>> >>> The uint64_t here ... >>> >>>> #undef set_xen_guest_handle_raw >>>> #define set_xen_guest_handle_raw(hnd, val) \ >>>> do { if ( sizeof(hnd) == 8 ) *(uint64_t *)&(hnd) = 0; \ >>>> (hnd).p = val; \ >>>> } while ( 0 ) >>>> >>>> #if defined(__XEN__) || (__XEN_TOOLS__) >>>> #define uint64_aligned_t uint64_t __attribute__((aligned(8))) >>>> #define __XEN_GUEST_HANDLE_64(name) __guest_handle_64_ ## name >>>> #define XEN_GUEST_HANDLE_64(name) __XEN_GUEST_HANDLE_64(name) >>>> #endif >>>> >>>> #define __XEN_GUEST_HANDLE_NEW(name) __guest_handle_new_ ## name >>>> /* This must be aligned to 8 bytes with padding if necessary. */ >>>> #define XEN_GUEST_HANDLE_NEW(name) __XEN_GUEST_HANDLE_NEW(name) >>> >>> ... does in no way satisfy the comment here, so what''s the point? >> >> The comment is unclear, sorry. >> >> /* A structure containing this type of guest handle must align the >> field to 8 bytes, using padding fields as necessary. */ > > Okay, now I understand your intention at least. However, I > still don''t see a point in doing what you try to do - the consumer > still has to add stuff along with the (oddly named) new handle > type, so why not having it take care of padding _and_ sizing?I want the structure to be identical for 32-bit and 64-bit guests so compat code is not required in the hypervisor. Without a macro like XEN_GUEST_HANDLE_64() how do you suggest the structure is defined so the field is sized correctly (i.e., 8 bytes)? Are you suggesting something like? typedef struct xen_kexec_load { uint8_t type; /* One of KEXEC_TYPE_* */ uint8_t _pad; uint16_t arch; /* ELF machine type (EM_*). */ uint32_t nr_segments; union { XEN_GUEST_HANDLE(xen_kexec_segment_t) segments; uint64_t _qword; } u; uint64_t entry_maddr; /* image entry point machine address. */ } xen_kexec_load_t; David
Jan Beulich
2013-Jun-25 14:59 UTC
Re: [PATCH 05/10] kexec: extend hypercall with improved load/unload ops
>>> On 25.06.13 at 16:30, David Vrabel <david.vrabel@citrix.com> wrote: > On 25/06/13 09:31, Jan Beulich wrote: >>>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: >>> >>> +static int init_transition_pgtable(struct kexec_image *image, l4_pgentry_t *l4) > [...] >>> + l3 = __map_domain_page(l3_page); >>> + l3 += l3_table_offset(vaddr); >>> + if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) ) >>> + { >>> + l2_page = kimage_alloc_control_page(image, 0); >>> + if ( !l2_page ) >>> + goto out; >>> + l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR)); >>> + } >>> + else >>> + l2_page = l3e_get_page(*l3); >> >> Afaict you''re done using "l3" here, so you should unmap it in order >> to reduce the pressure on the domain page mapping resources. > > The unmaps are grouped at the end to make the error paths simpler and I > would prefer to keep it like this. This is only using 4 entries. Are > we really that short?4 entries are fine as long as calling code doesn''t also (now or in the future) want to keep stuff mapped around calling this.>>> +static int build_reloc_page_table(struct kexec_image *image) >>> +{ >>> + struct page_info *l4_page; >>> + l4_pgentry_t *l4; >>> + int result; >>> + >>> + l4_page = kimage_alloc_control_page(image, 0); >>> + if ( !l4_page ) >>> + return -ENOMEM; >>> + l4 = __map_domain_page(l4_page); >>> + >>> + result = init_level4_page(image, l4, 0, max_page << PAGE_SHIFT); >> >> What about holes in the physical address space - not just the >> MMIO hole below 4Gb is a problem here, but also discontiguous >> physical memory. > > I don''t see a problem with creating mappings for non-RAM regions.You absolutely must not map regions you don''t know anything about with WB attribute, or else side effects of prefetches can create very hard to debug issues.>>> --- /dev/null >>> +++ b/xen/arch/x86/x86_64/kexec_reloc.S > [...] >> And just 8 here. > > I seem to recall reading that some processors needed 16 byte alignment > for the GDT. I may be misremembering or this was for an older processor > that Xen no longer supports.I''m unaware of such, and e.g. trampoline_gdt is currently not even 8-byte aligned. But if you can point to documentation saying so, I certainly won''t abject playing by their rules.>>> +compat_mode_gdt: >>> + .quad 0x0000000000000000 /* null */ >>> + .quad 0x00cf92000000ffff /* 0x0008 ring 0 data */ >>> + .quad 0x00cf9a000000ffff /* 0x0010 ring 0 code, compatibility */ > [...] >>> + /* >>> + * 16 words of stack are more than enough. >>> + */ >>> + .fill 16,8,0 >>> +reloc_stack: >> >> And now you don''t care for the stack being mis-aligned? > > I do find the way you make some review comments as a question like this > rather ambiguous. I guess I don''t care? But now I''m not sure if I should.I was just puzzled by the earlier over-aligning and the complete lack of alignment here. Jan
Jan Beulich
2013-Jun-25 15:02 UTC
Re: [PATCH 02/10] xen: make GUEST_HANDLE_64() and uint64_aligned_t available everywhere
>>> On 25.06.13 at 16:48, David Vrabel <david.vrabel@citrix.com> wrote: > I want the structure to be identical for 32-bit and 64-bit guests so > compat code is not required in the hypervisor. > > Without a macro like XEN_GUEST_HANDLE_64() how do you suggest the > structure is defined so the field is sized correctly (i.e., 8 bytes)? > > Are you suggesting something like? > > typedef struct xen_kexec_load { > uint8_t type; /* One of KEXEC_TYPE_* */ > uint8_t _pad; > uint16_t arch; /* ELF machine type (EM_*). */ > uint32_t nr_segments; > union { > XEN_GUEST_HANDLE(xen_kexec_segment_t) segments; > uint64_t _qword; > } u; > uint64_t entry_maddr; /* image entry point machine address. */ > } xen_kexec_load_t;Yes - there are already other cases similar to this in the public headers, for example struct vcpu_register_runstate_memory_area and struct vcpu_register_time_memory_area. Jan
Ian Campbell
2013-Jun-25 16:38 UTC
Re: [PATCH 09/10] x86: check kexec relocation code fits in a page
On Tue, 2013-06-25 at 12:38 +0100, Jan Beulich wrote:> >>> On 25.06.13 at 11:31, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 25/06/13 09:33, Jan Beulich wrote: > >>>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: > >>> --- a/xen/arch/x86/xen.lds.S > >>> +++ b/xen/arch/x86/xen.lds.S > >>> @@ -186,3 +186,7 @@ SECTIONS > >>> .stab.indexstr 0 : { *(.stab.indexstr) } > >>> .comment 0 : { *(.comment) } > >>> } > >>> + > >>> +#ifdef CONFIG_KEXEC > >>> +ASSERT(__kexec_reloc_size <= PAGE_SIZE, "kexec control code is too large") > >>> +#endif > >> I don''t recall having seen a mechanism to disable CONFIG_KEXEC, so > >> why the conditional? > > > > CONFIG_KEXEC exists in include/asm-x86/config.h, but it turns out not to > > compile if you disable it. > > This is more of an announcement than a knob for disabling (such > that e.g. generic code can exclude respective pieces from getting > built).It would have been better to call these things HAVE_FOO rather than CONFIG_FOO to avoid the implication of configurability, but that horse is long gone...> > > I for one would not mind in the slightest if CONFIG_KEXEC disappeared. > > We should keep it at least as long as ARM doesn''t support it, and > perhaps even after to be prepared for new ports that (initially) > don''t have the necessary support bits. > > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Daniel Kiper
2013-Jun-25 18:52 UTC
Re: [PATCH 05/10] kexec: extend hypercall with improved load/unload ops
On Mon, Jun 24, 2013 at 06:42:16PM +0100, David Vrabel wrote:> From: David Vrabel <david.vrabel@citrix.com> > > In the existing kexec hypercall, the load and unload ops depend on > internals of the Linux kernel (the page list and code page provided by > the kernel). The code page is used to transition between Xen context > and the image so using kernel code doesn''t make sense and will not > work for PVH guests. > > Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops > that no longer require a code page to be provided by the guest -- Xen > now provides the code for calling the image directly. > > The new load op looks similar to the Linux kexec_load system call and > allows the guest to provide the image data to be loaded. The guest > specifies the architecture of the image which may be a 32-bit subarch > of the hypervisor''s architecture (i.e., an EM_386 image on an > EM_X86_64 hypervisor). > > The toolstack can now load images without kernel involvement. This is > required for supporting kexec when using a dom0 with an upstream > kernel. > > Crash images are copied directly into the crash region on load. > Default images are copied into domheap pages and a list of source and > destination machine addresses is created. This is list is used in > kexec_reloc() to relocate the image to its destination. > > The old load and unload sub-ops are still available (as > KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top > of the new infrastructure. > > Signed-off-by: David Vrabel <david.vrabel@citrix.com> > Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> > Tested-by: Daniel Kiper <daniel.kiper@oracle.com>[...]> diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S > new file mode 100644 > index 0000000..135cbcd > --- /dev/null > +++ b/xen/arch/x86/x86_64/kexec_reloc.S[...]> + .globl __kexec_reloc_size > + .set __kexec_reloc_size, . - kexec_reloc > + .globl kexec_reloc_size > +kexec_reloc_size: > + .quad __kexec_reloc_sizeWhy do you define two variables to store the same value? Why quad not long? I think that you could do that: .globl kexec_reloc_size /* Personaly I prefer do this at the beginning of S file. */ kexec_reloc_size: .long . - xen_relocate_kernel It should work for C files and linker script. Daniel
Daniel Kiper
2013-Jun-25 19:00 UTC
Re: [PATCH 09/10] x86: check kexec relocation code fits in a page
On Mon, Jun 24, 2013 at 06:42:20PM +0100, David Vrabel wrote:> From: David Vrabel <david.vrabel@citrix.com> > > The kexec relocation (control) code must fit in a single page so add a > link time check for this. > > Signed-off-by: David Vrabel <david.vrabel@citrix.com> > --- > xen/arch/x86/xen.lds.S | 4 ++++ > 1 files changed, 4 insertions(+), 0 deletions(-) > > diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S > index d959941..eebed01 100644 > --- a/xen/arch/x86/xen.lds.S > +++ b/xen/arch/x86/xen.lds.S > @@ -186,3 +186,7 @@ SECTIONS > .stab.indexstr 0 : { *(.stab.indexstr) } > .comment 0 : { *(.comment) } > } > + > +#ifdef CONFIG_KEXEC > +ASSERT(__kexec_reloc_size <= PAGE_SIZE, "kexec control code is too large")ASSERT(kexec_reloc_size <= KEXEC_CONTROL_PAGE_SIZE, "kexec control code is too large") Daniel
Daniel Kiper
2013-Jun-25 19:27 UTC
Re: [PATCHv6 0/10] kexec: extend kexec hypercall for use with pv-ops kernels
On Mon, Jun 24, 2013 at 06:42:11PM +0100, David Vrabel wrote: [...]> Changes since v4 (v5 was not posted to the list): > > - _rsvd -> _pad in one of the public ABI structures. > - Fix bug where trailing pages were not zeroed. This fixes loading a > 64-bit Linux kernel using a more recent version of kexec-tools.Why? I do not see why trailing pages must be zeroed. I am afraid that this way you are only masking bug in kexec-tools. I think it is better to do bisect on it and find out which patch introduces a bug. In general this series is OK but as I can see Jan has some comments this time. I think that it is worth to take most of them into account. Please prepare next version of these patches and repost with kexec-tools patches. They are integral part of new Xen kexec implementation and it is worth to review both patch series together. Daniel PS I will be on holiday next week and I would not be able to do reviews.
David Vrabel
2013-Jun-26 09:44 UTC
Re: [PATCHv6 0/10] kexec: extend kexec hypercall for use with pv-ops kernels
On 25/06/13 20:27, Daniel Kiper wrote:> On Mon, Jun 24, 2013 at 06:42:11PM +0100, David Vrabel wrote: > > [...] > >> Changes since v4 (v5 was not posted to the list): >> >> - _rsvd -> _pad in one of the public ABI structures. >> - Fix bug where trailing pages were not zeroed. This fixes loading a >> 64-bit Linux kernel using a more recent version of kexec-tools. > > Why? I do not see why trailing pages must be zeroed. I am afraid that > this way you are only masking bug in kexec-tools. I think it is better > to do bisect on it and find out which patch introduces a bug.The load sub-ob is defined such that the range (dest_maddr + buf_size, dest_maddr + dest_size] is zeroed. If the caller asks for lots of trailing zeros then we should do this (regardless of how pointless this is). I do note that this is not very well documented and I will improve this. kexec-tools calculates the checksum of the image including the trailing zeroed region which is why the exec would fail. If you want to improve kexec-tools to load fewer zeroed pages, be my guest.> Please prepare next version of these patches and repost with > kexec-tools patches. They are integral part of new Xen kexec > implementation and it is worth to review both patch series together.The kexec-tools patches haven''t changed and I don''t see any merit in reposting them until the Xen patches are applied. David
David Vrabel
2013-Jun-26 09:50 UTC
Re: [PATCH 09/10] x86: check kexec relocation code fits in a page
On 25/06/13 20:00, Daniel Kiper wrote:> On Mon, Jun 24, 2013 at 06:42:20PM +0100, David Vrabel wrote: >> From: David Vrabel <david.vrabel@citrix.com> >> >> The kexec relocation (control) code must fit in a single page so add a >> link time check for this. >> >> Signed-off-by: David Vrabel <david.vrabel@citrix.com> >> --- >> xen/arch/x86/xen.lds.S | 4 ++++ >> 1 files changed, 4 insertions(+), 0 deletions(-) >> >> diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S >> index d959941..eebed01 100644 >> --- a/xen/arch/x86/xen.lds.S >> +++ b/xen/arch/x86/xen.lds.S >> @@ -186,3 +186,7 @@ SECTIONS >> .stab.indexstr 0 : { *(.stab.indexstr) } >> .comment 0 : { *(.comment) } >> } >> + >> +#ifdef CONFIG_KEXEC >> +ASSERT(__kexec_reloc_size <= PAGE_SIZE, "kexec control code is too large") > > ASSERT(kexec_reloc_size <= KEXEC_CONTROL_PAGE_SIZE, "kexec control code is too large")Huh. I thought I''d removed KEXEC_CONTROL_PAGE_SIZE but I see the #define is still there. David
Jan Beulich
2013-Jun-26 09:52 UTC
Re: [PATCHv6 0/10] kexec: extend kexec hypercall for use with pv-ops kernels
>>> On 26.06.13 at 11:44, David Vrabel <david.vrabel@citrix.com> wrote: > On 25/06/13 20:27, Daniel Kiper wrote: >> Please prepare next version of these patches and repost with >> kexec-tools patches. They are integral part of new Xen kexec >> implementation and it is worth to review both patch series together. > > The kexec-tools patches haven''t changed and I don''t see any merit in > reposting them until the Xen patches are applied.So what if we apply the Xen patches, and then you''re asked to do changes to the tools ones, requiring adjustments to the interface? Or if the tools side patches get rejected altogether? There certainly ought to be some mutual agreement that both will get applied in a certain (final) shape... Jan
David Vrabel
2013-Jun-27 17:17 UTC
Re: [PATCH 04/10] kexec: add infrastructure for handling kexec images
On 25/06/13 08:54, Jan Beulich wrote:>>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: >> >> + goto out; >> + >> + image->control_page = ~0; /* By default this does not apply */ > > Use a recognizable #define instead?This assignment serves no purpose so I''ll leaved it zeroed.>> + mstart = image->segments[i].dest_maddr; >> + mend = mstart + image->segments[i].dest_size; >> + if ( (mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK) ) > > Expressions like this can be abbreviated to > > if ( (mstart | mend) & ~PAGE_MASK )I think the original more clearly reflects the intent. David
David Vrabel
2013-Jun-27 17:29 UTC
Re: [PATCH 03/10] kexec: add public interface for improved load/unload sub-ops
On 25/06/13 08:45, Jan Beulich wrote:>>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: >> @@ -152,6 +152,64 @@ typedef struct xen_kexec_range { >> unsigned long start; >> } xen_kexec_range_t; >> >> +#if __XEN_INTERFACE_VERSION__ >= 0x00040300 > > ... >= 0x00040400Yes, as soon as Xen''s version becomes 4.4, but for now it''s still 4.3. David
David Vrabel
2013-Jun-27 17:39 UTC
Re: [PATCH 05/10] kexec: extend hypercall with improved load/unload ops
On 25/06/13 19:52, Daniel Kiper wrote:> On Mon, Jun 24, 2013 at 06:42:16PM +0100, David Vrabel wrote: >> >> + .globl __kexec_reloc_size >> + .set __kexec_reloc_size, . - kexec_reloc >> + .globl kexec_reloc_size >> +kexec_reloc_size: >> + .quad __kexec_reloc_size > > Why do you define two variables to store the same value? > > .globl kexec_reloc_size /* Personaly I prefer do this at the beginning of S file. */ > kexec_reloc_size: > .long . - xen_relocate_kernel > > It should work for C files and linker script.This doesn''t work, to the linker the value of kexec_reloc_size is its address. David
Jan Beulich
2013-Jun-28 06:53 UTC
Re: [PATCH 03/10] kexec: add public interface for improved load/unload sub-ops
>>> On 27.06.13 at 19:29, David Vrabel <david.vrabel@citrix.com> wrote: > On 25/06/13 08:45, Jan Beulich wrote: >>>>> On 24.06.13 at 19:42, David Vrabel <david.vrabel@citrix.com> wrote: >>> @@ -152,6 +152,64 @@ typedef struct xen_kexec_range { >>> unsigned long start; >>> } xen_kexec_range_t; >>> >>> +#if __XEN_INTERFACE_VERSION__ >= 0x00040300 >> >> ... >= 0x00040400 > > Yes, as soon as Xen''s version becomes 4.4, but for now it''s still 4.3.Not exactly - if no patch before yours bumps the interface version, your one will need to. Jan