Daniel Kiper
2012-Dec-27  02:18 UTC
[PATCH v3 00/11] xen: Initial kexec/kdump implementation
Hi,
This set of patches contains initial kexec/kdump implementation for Xen v3.
Currently only dom0 is supported, however, almost all infrustructure
required for domU support is ready.
Jan Beulich suggested to merge Xen x86 assembler code with baremetal x86 code.
This could simplify and reduce a bit size of kernel code. However, this solution
requires some changes in baremetal x86 code. First of all code which establishes
transition page table should be moved back from machine_kexec_$(BITS).c to
relocate_kernel_$(BITS).S. Another important thing which should be changed in
that
case is format of page_list array. Xen kexec hypercall requires to alternate
physical
addresses with virtual ones. These and other required stuff have not been done
in that
version because I am not sure that solution will be accepted by kexec/kdump
maintainers.
I hope that this email spark discussion about that topic.
Daniel
 arch/x86/Kconfig                     |    3 +
 arch/x86/include/asm/kexec.h         |   10 +-
 arch/x86/include/asm/xen/hypercall.h |    6 +
 arch/x86/include/asm/xen/kexec.h     |   79 ++++
 arch/x86/kernel/machine_kexec_64.c   |   12 +-
 arch/x86/kernel/vmlinux.lds.S        |    7 +-
 arch/x86/xen/Kconfig                 |    1 +
 arch/x86/xen/Makefile                |    3 +
 arch/x86/xen/enlighten.c             |   11 +
 arch/x86/xen/kexec.c                 |  150 +++++++
 arch/x86/xen/machine_kexec_32.c      |  226 +++++++++++
 arch/x86/xen/machine_kexec_64.c      |  318 +++++++++++++++
 arch/x86/xen/relocate_kernel_32.S    |  323 +++++++++++++++
 arch/x86/xen/relocate_kernel_64.S    |  309 ++++++++++++++
 drivers/xen/sys-hypervisor.c         |   42 ++-
 include/linux/kexec.h                |   26 ++-
 include/xen/interface/xen.h          |   33 ++
 kernel/Makefile                      |    1 +
 kernel/kexec-firmware.c              |  743 ++++++++++++++++++++++++++++++++++
 kernel/kexec.c                       |   46 ++-
 20 files changed, 2331 insertions(+), 18 deletions(-)
Daniel Kiper (11):
      kexec: introduce kexec firmware support
      x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and
PTE
      xen: Introduce architecture independent data for kexec/kdump
      x86/xen: Introduce architecture dependent data for kexec/kdump
      x86/xen: Register resources required by kexec-tools
      x86/xen: Add i386 kexec/kdump implementation
      x86/xen: Add x86_64 kexec/kdump implementation
      x86/xen: Add kexec/kdump Kconfig and makefile rules
      x86/xen/enlighten: Add init and crash kexec/kdump hooks
      drivers/xen: Export vmcoreinfo through sysfs
      x86: Add Xen kexec control code size check to linker script
Daniel Kiper
2012-Dec-27  02:18 UTC
[PATCH v3 01/11] kexec: introduce kexec firmware support
Some kexec/kdump implementations (e.g. Xen PVOPS) could not use default
Linux infrastructure and require some support from firmware and/or hypervisor.
To cope with that problem kexec firmware infrastructure was introduced.
It allows a developer to use all kexec/kdump features of given firmware
or hypervisor.
v3 - suggestions/fixes:
   - replace kexec_ops struct by kexec firmware infrastructure
     (suggested by Eric Biederman).
v2 - suggestions/fixes:
   - add comment for kexec_ops.crash_alloc_temp_store member
     (suggested by Konrad Rzeszutek Wilk),
   - simplify kexec_ops usage
     (suggested by Konrad Rzeszutek Wilk).
Signed-off-by: Daniel Kiper <daniel.kiper at oracle.com>
---
 include/linux/kexec.h   |   26 ++-
 kernel/Makefile         |    1 +
 kernel/kexec-firmware.c |  743 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/kexec.c          |   46 +++-
 4 files changed, 809 insertions(+), 7 deletions(-)
 create mode 100644 kernel/kexec-firmware.c
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d0b8458..9568457 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -116,17 +116,34 @@ struct kimage {
 #endif
 };
 
-
-
 /* kexec interface functions */
 extern void machine_kexec(struct kimage *image);
 extern int machine_kexec_prepare(struct kimage *image);
 extern void machine_kexec_cleanup(struct kimage *image);
+extern struct page *mf_kexec_kimage_alloc_pages(gfp_t gfp_mask,
+						unsigned int order,
+						unsigned long limit);
+extern void mf_kexec_kimage_free_pages(struct page *page);
+extern unsigned long mf_kexec_page_to_pfn(struct page *page);
+extern struct page *mf_kexec_pfn_to_page(unsigned long mfn);
+extern unsigned long mf_kexec_virt_to_phys(volatile void *address);
+extern void *mf_kexec_phys_to_virt(unsigned long address);
+extern int mf_kexec_prepare(struct kimage *image);
+extern int mf_kexec_load(struct kimage *image);
+extern void mf_kexec_cleanup(struct kimage *image);
+extern void mf_kexec_unload(struct kimage *image);
+extern void mf_kexec_shutdown(void);
+extern void mf_kexec(struct kimage *image);
 extern asmlinkage long sys_kexec_load(unsigned long entry,
 					unsigned long nr_segments,
 					struct kexec_segment __user *segments,
 					unsigned long flags);
+extern long firmware_sys_kexec_load(unsigned long entry,
+					unsigned long nr_segments,
+					struct kexec_segment __user *segments,
+					unsigned long flags);
 extern int kernel_kexec(void);
+extern int firmware_kernel_kexec(void);
 #ifdef CONFIG_COMPAT
 extern asmlinkage long compat_sys_kexec_load(unsigned long entry,
 				unsigned long nr_segments,
@@ -135,7 +152,10 @@ extern asmlinkage long compat_sys_kexec_load(unsigned long
entry,
 #endif
 extern struct page *kimage_alloc_control_pages(struct kimage *image,
 						unsigned int order);
+extern struct page *firmware_kimage_alloc_control_pages(struct kimage *image,
+							unsigned int order);
 extern void crash_kexec(struct pt_regs *);
+extern void firmware_crash_kexec(struct pt_regs *);
 int kexec_should_crash(struct task_struct *);
 void crash_save_cpu(struct pt_regs *regs, int cpu);
 void crash_save_vmcoreinfo(void);
@@ -168,6 +188,8 @@ unsigned long paddr_vmcoreinfo_note(void);
 #define VMCOREINFO_CONFIG(name) \
 	vmcoreinfo_append_str("CONFIG_%s=y\n", #name)
 
+extern bool kexec_use_firmware;
+
 extern struct kimage *kexec_image;
 extern struct kimage *kexec_crash_image;
 
diff --git a/kernel/Makefile b/kernel/Makefile
index 6c072b6..bc96b2f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -58,6 +58,7 @@ obj-$(CONFIG_MODULE_SIG) += module_signing.o modsign_pubkey.o
modsign_certificat
 obj-$(CONFIG_KALLSYMS) += kallsyms.o
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_KEXEC) += kexec.o
+obj-$(CONFIG_KEXEC_FIRMWARE) += kexec-firmware.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
diff --git a/kernel/kexec-firmware.c b/kernel/kexec-firmware.c
new file mode 100644
index 0000000..f6ddd4c
--- /dev/null
+++ b/kernel/kexec-firmware.c
@@ -0,0 +1,743 @@
+/*
+ * Copyright (C) 2002-2004 Eric Biederman  <ebiederm at xmission.com>
+ * Copyright (C) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * Most of the code here is a copy of kernel/kexec.c.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include <linux/atomic.h>
+#include <linux/errno.h>
+#include <linux/highmem.h>
+#include <linux/kernel.h>
+#include <linux/kexec.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/reboot.h>
+#include <linux/slab.h>
+
+#include <asm/uaccess.h>
+
+/*
+ * KIMAGE_NO_DEST is an impossible destination address..., for
+ * allocating pages whose destination address we do not care about.
+ */
+#define KIMAGE_NO_DEST (-1UL)
+
+static int kimage_is_destination_range(struct kimage *image,
+				       unsigned long start, unsigned long end);
+static struct page *kimage_alloc_page(struct kimage *image,
+				       gfp_t gfp_mask,
+				       unsigned long dest);
+
+static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
+	                    unsigned long nr_segments,
+                            struct kexec_segment __user *segments)
+{
+	size_t segment_bytes;
+	struct kimage *image;
+	unsigned long i;
+	int result;
+
+	/* Allocate a controlling structure */
+	result = -ENOMEM;
+	image = kzalloc(sizeof(*image), GFP_KERNEL);
+	if (!image)
+		goto out;
+
+	image->head = 0;
+	image->entry = &image->head;
+	image->last_entry = &image->head;
+	image->control_page = ~0; /* By default this does not apply */
+	image->start = entry;
+	image->type = KEXEC_TYPE_DEFAULT;
+
+	/* Initialize the list of control pages */
+	INIT_LIST_HEAD(&image->control_pages);
+
+	/* Initialize the list of destination pages */
+	INIT_LIST_HEAD(&image->dest_pages);
+
+	/* Initialize the list of unusable pages */
+	INIT_LIST_HEAD(&image->unuseable_pages);
+
+	/* Read in the segments */
+	image->nr_segments = nr_segments;
+	segment_bytes = nr_segments * sizeof(*segments);
+	result = copy_from_user(image->segment, segments, segment_bytes);
+	if (result) {
+		result = -EFAULT;
+		goto out;
+	}
+
+	/*
+	 * Verify we have good destination addresses.  The caller is
+	 * responsible for making certain we don't attempt to load
+	 * the new image into invalid or reserved areas of RAM.  This
+	 * just verifies it is an address we can use.
+	 *
+	 * Since the kernel does everything in page size chunks ensure
+	 * the destination addresses are page aligned.  Too many
+	 * special cases crop of when we don't do this.  The most
+	 * insidious is getting overlapping destination addresses
+	 * simply because addresses are changed to page size
+	 * granularity.
+	 */
+	result = -EADDRNOTAVAIL;
+	for (i = 0; i < nr_segments; i++) {
+		unsigned long mstart, mend;
+
+		mstart = image->segment[i].mem;
+		mend   = mstart + image->segment[i].memsz;
+		if ((mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK))
+			goto out;
+		if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
+			goto out;
+	}
+
+	/* Verify our destination addresses do not overlap.
+	 * If we alloed overlapping destination addresses
+	 * through very weird things can happen with no
+	 * easy explanation as one segment stops on another.
+	 */
+	result = -EINVAL;
+	for (i = 0; i < nr_segments; i++) {
+		unsigned long mstart, mend;
+		unsigned long j;
+
+		mstart = image->segment[i].mem;
+		mend   = mstart + image->segment[i].memsz;
+		for (j = 0; j < i; j++) {
+			unsigned long pstart, pend;
+			pstart = image->segment[j].mem;
+			pend   = pstart + image->segment[j].memsz;
+			/* Do the segments overlap ? */
+			if ((mend > pstart) && (mstart < pend))
+				goto out;
+		}
+	}
+
+	/* Ensure our buffer sizes are strictly less than
+	 * our memory sizes.  This should always be the case,
+	 * and it is easier to check up front than to be surprised
+	 * later on.
+	 */
+	result = -EINVAL;
+	for (i = 0; i < nr_segments; i++) {
+		if (image->segment[i].bufsz > image->segment[i].memsz)
+			goto out;
+	}
+
+	result = 0;
+out:
+	if (result == 0)
+		*rimage = image;
+	else
+		kfree(image);
+
+	return result;
+
+}
+
+static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
+				unsigned long nr_segments,
+				struct kexec_segment __user *segments)
+{
+	int result;
+	struct kimage *image;
+
+	/* Allocate and initialize a controlling structure */
+	image = NULL;
+	result = do_kimage_alloc(&image, entry, nr_segments, segments);
+	if (result)
+		goto out;
+
+	*rimage = image;
+
+	/*
+	 * Find a location for the control code buffer, and add it
+	 * the vector of segments so that it's pages will also be
+	 * counted as destination pages.
+	 */
+	result = -ENOMEM;
+	image->control_code_page = firmware_kimage_alloc_control_pages(image,
+					   get_order(KEXEC_CONTROL_PAGE_SIZE));
+	if (!image->control_code_page) {
+		printk(KERN_ERR "Could not allocate control_code_buffer\n");
+		goto out;
+	}
+
+	image->swap_page = firmware_kimage_alloc_control_pages(image, 0);
+	if (!image->swap_page) {
+		printk(KERN_ERR "Could not allocate swap buffer\n");
+		goto out;
+	}
+
+	result = 0;
+ out:
+	if (result == 0)
+		*rimage = image;
+	else
+		kfree(image);
+
+	return result;
+}
+
+static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
+				unsigned long nr_segments,
+				struct kexec_segment __user *segments)
+{
+	int result;
+	struct kimage *image;
+	unsigned long i;
+
+	image = NULL;
+	/* Verify we have a valid entry point */
+	if ((entry < crashk_res.start) || (entry > crashk_res.end)) {
+		result = -EADDRNOTAVAIL;
+		goto out;
+	}
+
+	/* Allocate and initialize a controlling structure */
+	result = do_kimage_alloc(&image, entry, nr_segments, segments);
+	if (result)
+		goto out;
+
+	/* Enable the special crash kernel control page
+	 * allocation policy.
+	 */
+	image->control_page = crashk_res.start;
+	image->type = KEXEC_TYPE_CRASH;
+
+	/*
+	 * Verify we have good destination addresses.  Normally
+	 * the caller is responsible for making certain we don't
+	 * attempt to load the new image into invalid or reserved
+	 * areas of RAM.  But crash kernels are preloaded into a
+	 * reserved area of ram.  We must ensure the addresses
+	 * are in the reserved area otherwise preloading the
+	 * kernel could corrupt things.
+	 */
+	result = -EADDRNOTAVAIL;
+	for (i = 0; i < nr_segments; i++) {
+		unsigned long mstart, mend;
+
+		mstart = image->segment[i].mem;
+		mend = mstart + image->segment[i].memsz - 1;
+		/* Ensure we are within the crash kernel limits */
+		if ((mstart < crashk_res.start) || (mend > crashk_res.end))
+			goto out;
+	}
+
+	/*
+	 * Find a location for the control code buffer, and add
+	 * the vector of segments so that it's pages will also be
+	 * counted as destination pages.
+	 */
+	result = -ENOMEM;
+	image->control_code_page = firmware_kimage_alloc_control_pages(image,
+					   get_order(KEXEC_CONTROL_PAGE_SIZE));
+	if (!image->control_code_page) {
+		printk(KERN_ERR "Could not allocate control_code_buffer\n");
+		goto out;
+	}
+
+	result = 0;
+out:
+	if (result == 0)
+		*rimage = image;
+	else
+		kfree(image);
+
+	return result;
+}
+
+static int kimage_is_destination_range(struct kimage *image,
+					unsigned long start,
+					unsigned long end)
+{
+	unsigned long i;
+
+	for (i = 0; i < image->nr_segments; i++) {
+		unsigned long mstart, mend;
+
+		mstart = image->segment[i].mem;
+		mend = mstart + image->segment[i].memsz;
+		if ((end > mstart) && (start < mend))
+			return 1;
+	}
+
+	return 0;
+}
+
+static void kimage_free_page_list(struct list_head *list)
+{
+	struct list_head *pos, *next;
+
+	list_for_each_safe(pos, next, list) {
+		struct page *page;
+
+		page = list_entry(pos, struct page, lru);
+		list_del(&page->lru);
+		mf_kexec_kimage_free_pages(page);
+	}
+}
+
+static struct page *kimage_alloc_normal_control_pages(struct kimage *image,
+							unsigned int order)
+{
+	/* Control pages are special, they are the intermediaries
+	 * that are needed while we copy the rest of the pages
+	 * to their final resting place.  As such they must
+	 * not conflict with either the destination addresses
+	 * or memory the kernel is already using.
+	 *
+	 * The only case where we really need more than one of
+	 * these are for architectures where we cannot disable
+	 * the MMU and must instead generate an identity mapped
+	 * page table for all of the memory.
+	 *
+	 * At worst this runs in O(N) of the image size.
+	 */
+	struct list_head extra_pages;
+	struct page *pages;
+	unsigned int count;
+
+	count = 1 << order;
+	INIT_LIST_HEAD(&extra_pages);
+
+	/* Loop while I can allocate a page and the page allocated
+	 * is a destination page.
+	 */
+	do {
+		unsigned long pfn, epfn, addr, eaddr;
+
+		pages = mf_kexec_kimage_alloc_pages(GFP_KERNEL, order,
+							KEXEC_CONTROL_MEMORY_LIMIT);
+		if (!pages)
+			break;
+		pfn   = mf_kexec_page_to_pfn(pages);
+		epfn  = pfn + count;
+		addr  = pfn << PAGE_SHIFT;
+		eaddr = epfn << PAGE_SHIFT;
+		if ((epfn >= (KEXEC_CONTROL_MEMORY_LIMIT >> PAGE_SHIFT)) ||
+			      kimage_is_destination_range(image, addr, eaddr)) {
+			list_add(&pages->lru, &extra_pages);
+			pages = NULL;
+		}
+	} while (!pages);
+
+	if (pages) {
+		/* Remember the allocated page... */
+		list_add(&pages->lru, &image->control_pages);
+
+		/* Because the page is already in it's destination
+		 * location we will never allocate another page at
+		 * that address.  Therefore mf_kexec_kimage_alloc_pages
+		 * will not return it (again) and we don't need
+		 * to give it an entry in image->segment[].
+		 */
+	}
+	/* Deal with the destination pages I have inadvertently allocated.
+	 *
+	 * Ideally I would convert multi-page allocations into single
+	 * page allocations, and add everything to image->dest_pages.
+	 *
+	 * For now it is simpler to just free the pages.
+	 */
+	kimage_free_page_list(&extra_pages);
+
+	return pages;
+}
+
+struct page *firmware_kimage_alloc_control_pages(struct kimage *image,
+							unsigned int order)
+{
+	return kimage_alloc_normal_control_pages(image, order);
+}
+
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+	if (*image->entry != 0)
+		image->entry++;
+
+	if (image->entry == image->last_entry) {
+		kimage_entry_t *ind_page;
+		struct page *page;
+
+		page = kimage_alloc_page(image, GFP_KERNEL, KIMAGE_NO_DEST);
+		if (!page)
+			return -ENOMEM;
+
+		ind_page = page_address(page);
+		*image->entry = mf_kexec_virt_to_phys(ind_page) | IND_INDIRECTION;
+		image->entry = ind_page;
+		image->last_entry = ind_page +
+				      ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+	}
+	*image->entry = entry;
+	image->entry++;
+	*image->entry = 0;
+
+	return 0;
+}
+
+static int kimage_set_destination(struct kimage *image,
+				   unsigned long destination)
+{
+	int result;
+
+	destination &= PAGE_MASK;
+	result = kimage_add_entry(image, destination | IND_DESTINATION);
+	if (result == 0)
+		image->destination = destination;
+
+	return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+	int result;
+
+	page &= PAGE_MASK;
+	result = kimage_add_entry(image, page | IND_SOURCE);
+	if (result == 0)
+		image->destination += PAGE_SIZE;
+
+	return result;
+}
+
+
+static void kimage_free_extra_pages(struct kimage *image)
+{
+	/* Walk through and free any extra destination pages I may have */
+	kimage_free_page_list(&image->dest_pages);
+
+	/* Walk through and free any unusable pages I have cached */
+	kimage_free_page_list(&image->unuseable_pages);
+
+}
+static void kimage_terminate(struct kimage *image)
+{
+	if (*image->entry != 0)
+		image->entry++;
+
+	*image->entry = IND_DONE;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+	for (ptr = &image->head; (entry = *ptr) && !(entry &
IND_DONE); \
+		ptr = (entry & IND_INDIRECTION)? \
+			mf_kexec_phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+static void kimage_free_entry(kimage_entry_t entry)
+{
+	struct page *page;
+
+	page = mf_kexec_pfn_to_page(entry >> PAGE_SHIFT);
+	mf_kexec_kimage_free_pages(page);
+}
+
+static void kimage_free(struct kimage *image)
+{
+	kimage_entry_t *ptr, entry;
+	kimage_entry_t ind = 0;
+
+	if (!image)
+		return;
+
+	kimage_free_extra_pages(image);
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_INDIRECTION) {
+			/* Free the previous indirection page */
+			if (ind & IND_INDIRECTION)
+				kimage_free_entry(ind);
+			/* Save this indirection page until we are
+			 * done with it.
+			 */
+			ind = entry;
+		}
+		else if (entry & IND_SOURCE)
+			kimage_free_entry(entry);
+	}
+	/* Free the final indirection page */
+	if (ind & IND_INDIRECTION)
+		kimage_free_entry(ind);
+
+	/* Handle any machine specific cleanup */
+	mf_kexec_cleanup(image);
+
+	/* Free the kexec control pages... */
+	kimage_free_page_list(&image->control_pages);
+	kfree(image);
+}
+
+static kimage_entry_t *kimage_dst_used(struct kimage *image,
+					unsigned long page)
+{
+	kimage_entry_t *ptr, entry;
+	unsigned long destination = 0;
+
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_DESTINATION)
+			destination = entry & PAGE_MASK;
+		else if (entry & IND_SOURCE) {
+			if (page == destination)
+				return ptr;
+			destination += PAGE_SIZE;
+		}
+	}
+
+	return NULL;
+}
+
+static struct page *kimage_alloc_page(struct kimage *image,
+					gfp_t gfp_mask,
+					unsigned long destination)
+{
+	/*
+	 * Here we implement safeguards to ensure that a source page
+	 * is not copied to its destination page before the data on
+	 * the destination page is no longer useful.
+	 *
+	 * To do this we maintain the invariant that a source page is
+	 * either its own destination page, or it is not a
+	 * destination page at all.
+	 *
+	 * That is slightly stronger than required, but the proof
+	 * that no problems will not occur is trivial, and the
+	 * implementation is simply to verify.
+	 *
+	 * When allocating all pages normally this algorithm will run
+	 * in O(N) time, but in the worst case it will run in O(N^2)
+	 * time.   If the runtime is a problem the data structures can
+	 * be fixed.
+	 */
+	struct page *page;
+	unsigned long addr;
+
+	/*
+	 * Walk through the list of destination pages, and see if I
+	 * have a match.
+	 */
+	list_for_each_entry(page, &image->dest_pages, lru) {
+		addr = mf_kexec_page_to_pfn(page) << PAGE_SHIFT;
+		if (addr == destination) {
+			list_del(&page->lru);
+			return page;
+		}
+	}
+	page = NULL;
+	while (1) {
+		kimage_entry_t *old;
+
+		/* Allocate a page, if we run out of memory give up */
+		page = mf_kexec_kimage_alloc_pages(gfp_mask, 0,
+							KEXEC_SOURCE_MEMORY_LIMIT);
+		if (!page)
+			return NULL;
+		/* If the page cannot be used file it away */
+		if (mf_kexec_page_to_pfn(page) >
+				(KEXEC_SOURCE_MEMORY_LIMIT >> PAGE_SHIFT)) {
+			list_add(&page->lru, &image->unuseable_pages);
+			continue;
+		}
+		addr = mf_kexec_page_to_pfn(page) << PAGE_SHIFT;
+
+		/* If it is the destination page we want use it */
+		if (addr == destination)
+			break;
+
+		/* If the page is not a destination page use it */
+		if (!kimage_is_destination_range(image, addr,
+						  addr + PAGE_SIZE))
+			break;
+
+		/*
+		 * I know that the page is someones destination page.
+		 * See if there is already a source page for this
+		 * destination page.  And if so swap the source pages.
+		 */
+		old = kimage_dst_used(image, addr);
+		if (old) {
+			/* If so move it */
+			unsigned long old_addr;
+			struct page *old_page;
+
+			old_addr = *old & PAGE_MASK;
+			old_page = mf_kexec_pfn_to_page(old_addr >> PAGE_SHIFT);
+			copy_highpage(page, old_page);
+			*old = addr | (*old & ~PAGE_MASK);
+
+			/* The old page I have found cannot be a
+			 * destination page, so return it if it's
+			 * gfp_flags honor the ones passed in.
+			 */
+			if (!(gfp_mask & __GFP_HIGHMEM) &&
+			    PageHighMem(old_page)) {
+				mf_kexec_kimage_free_pages(old_page);
+				continue;
+			}
+			addr = old_addr;
+			page = old_page;
+			break;
+		}
+		else {
+			/* Place the page on the destination list I
+			 * will use it later.
+			 */
+			list_add(&page->lru, &image->dest_pages);
+		}
+	}
+
+	return page;
+}
+
+static int kimage_load_normal_segment(struct kimage *image,
+					 struct kexec_segment *segment)
+{
+	unsigned long maddr;
+	unsigned long ubytes, mbytes;
+	int result;
+	unsigned char __user *buf;
+
+	result = 0;
+	buf = segment->buf;
+	ubytes = segment->bufsz;
+	mbytes = segment->memsz;
+	maddr = segment->mem;
+
+	result = kimage_set_destination(image, maddr);
+	if (result < 0)
+		goto out;
+
+	while (mbytes) {
+		struct page *page;
+		char *ptr;
+		size_t uchunk, mchunk;
+
+		page = kimage_alloc_page(image, GFP_HIGHUSER, maddr);
+		if (!page) {
+			result  = -ENOMEM;
+			goto out;
+		}
+		result = kimage_add_page(image, mf_kexec_page_to_pfn(page)
+								<< PAGE_SHIFT);
+		if (result < 0)
+			goto out;
+
+		ptr = kmap(page);
+		/* Start with a clear page */
+		clear_page(ptr);
+		ptr += maddr & ~PAGE_MASK;
+		mchunk = PAGE_SIZE - (maddr & ~PAGE_MASK);
+		if (mchunk > mbytes)
+			mchunk = mbytes;
+
+		uchunk = mchunk;
+		if (uchunk > ubytes)
+			uchunk = ubytes;
+
+		result = copy_from_user(ptr, buf, uchunk);
+		kunmap(page);
+		if (result) {
+			result = -EFAULT;
+			goto out;
+		}
+		ubytes -= uchunk;
+		maddr  += mchunk;
+		buf    += mchunk;
+		mbytes -= mchunk;
+	}
+out:
+	return result;
+}
+
+static int kimage_load_segment(struct kimage *image,
+				struct kexec_segment *segment)
+{
+	return kimage_load_normal_segment(image, segment);
+}
+
+long firmware_sys_kexec_load(unsigned long entry, unsigned long nr_segments,
+				struct kexec_segment __user *segments,
+				unsigned long flags)
+{
+	struct kimage **dest_image, *image = NULL;
+	int result = 0;
+
+	dest_image = &kexec_image;
+	if (flags & KEXEC_ON_CRASH)
+		dest_image = &kexec_crash_image;
+	if (nr_segments > 0) {
+		unsigned long i;
+
+		/* Loading another kernel to reboot into */
+		if ((flags & KEXEC_ON_CRASH) == 0)
+			result = kimage_normal_alloc(&image, entry,
+							nr_segments, segments);
+		/* Loading another kernel to switch to if this one crashes */
+		else if (flags & KEXEC_ON_CRASH) {
+			/* Free any current crash dump kernel before
+			 * we corrupt it.
+			 */
+			mf_kexec_unload(image);
+			kimage_free(xchg(&kexec_crash_image, NULL));
+			result = kimage_crash_alloc(&image, entry,
+						     nr_segments, segments);
+		}
+		if (result)
+			goto out;
+
+		if (flags & KEXEC_PRESERVE_CONTEXT)
+			image->preserve_context = 1;
+		result = mf_kexec_prepare(image);
+		if (result)
+			goto out;
+
+		for (i = 0; i < nr_segments; i++) {
+			result = kimage_load_segment(image, &image->segment[i]);
+			if (result)
+				goto out;
+		}
+		kimage_terminate(image);
+	}
+
+	result = mf_kexec_load(image);
+
+	if (result)
+		goto out;
+
+	/* Install the new kernel, and  Uninstall the old */
+	image = xchg(dest_image, image);
+
+out:
+	mf_kexec_unload(image);
+
+	kimage_free(image);
+
+	return result;
+}
+
+void firmware_crash_kexec(struct pt_regs *regs)
+{
+	struct pt_regs fixed_regs;
+
+	crash_setup_regs(&fixed_regs, regs);
+	crash_save_vmcoreinfo();
+	machine_crash_shutdown(&fixed_regs);
+	mf_kexec(kexec_crash_image);
+}
+
+int firmware_kernel_kexec(void)
+{
+	kernel_restart_prepare(NULL);
+	printk(KERN_EMERG "Starting new kernel\n");
+	mf_kexec_shutdown();
+	mf_kexec(kexec_image);
+
+	return 0;
+}
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 5e4bd78..9f3b6cb 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -38,6 +38,10 @@
 #include <asm/io.h>
 #include <asm/sections.h>
 
+#ifdef CONFIG_KEXEC_FIRMWARE
+bool kexec_use_firmware = false;
+#endif
+
 /* Per cpu memory for storing cpu states in case of system crash. */
 note_buf_t __percpu *crash_notes;
 
@@ -924,7 +928,7 @@ static int kimage_load_segment(struct kimage *image,
  *   the devices in a consistent state so a later kernel can
  *   reinitialize them.
  *
- * - A machine specific part that includes the syscall number
+ * - A machine/firmware specific part that includes the syscall number
  *   and the copies the image to it's final destination.  And
  *   jumps into the image at entry.
  *
@@ -978,6 +982,17 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned
long, nr_segments,
 	if (!mutex_trylock(&kexec_mutex))
 		return -EBUSY;
 
+#ifdef CONFIG_KEXEC_FIRMWARE
+	if (kexec_use_firmware) {
+		result = firmware_sys_kexec_load(entry, nr_segments,
+							segments, flags);
+
+		mutex_unlock(&kexec_mutex);
+
+		return result;
+	}
+#endif
+
 	dest_image = &kexec_image;
 	if (flags & KEXEC_ON_CRASH)
 		dest_image = &kexec_crash_image;
@@ -1091,10 +1106,17 @@ void crash_kexec(struct pt_regs *regs)
 		if (kexec_crash_image) {
 			struct pt_regs fixed_regs;
 
-			crash_setup_regs(&fixed_regs, regs);
-			crash_save_vmcoreinfo();
-			machine_crash_shutdown(&fixed_regs);
-			machine_kexec(kexec_crash_image);
+#ifdef CONFIG_KEXEC_FIRMWARE
+			if (kexec_use_firmware)
+				firmware_crash_kexec(regs);
+			else
+#endif
+			{
+				crash_setup_regs(&fixed_regs, regs);
+				crash_save_vmcoreinfo();
+				machine_crash_shutdown(&fixed_regs);
+				machine_kexec(kexec_crash_image);
+			}
 		}
 		mutex_unlock(&kexec_mutex);
 	}
@@ -1132,6 +1154,13 @@ int crash_shrink_memory(unsigned long new_size)
 
 	mutex_lock(&kexec_mutex);
 
+#ifdef CONFIG_KEXEC_FIRMWARE
+	if (kexec_use_firmware) {
+		ret = -ENOSYS;
+		goto unlock;
+	}
+#endif
+
 	if (kexec_crash_image) {
 		ret = -ENOENT;
 		goto unlock;
@@ -1536,6 +1565,13 @@ int kernel_kexec(void)
 		goto Unlock;
 	}
 
+#ifdef CONFIG_KEXEC_FIRMWARE
+	if (kexec_use_firmware) {
+		error = firmware_kernel_kexec();
+		goto Unlock;
+	}
+#endif
+
 #ifdef CONFIG_KEXEC_JUMP
 	if (kexec_image->preserve_context) {
 		lock_system_sleep();
-- 
1.5.6.5
Daniel Kiper
2012-Dec-27  02:18 UTC
[PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
Some implementations (e.g. Xen PVOPS) could not use part of identity page table
to construct transition page table. It means that they require separate PUDs,
PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that
requirement add extra pointer to PGD, PUD, PMD and PTE and align existing code.
Signed-off-by: Daniel Kiper <daniel.kiper at oracle.com>
---
 arch/x86/include/asm/kexec.h       |   10 +++++++---
 arch/x86/kernel/machine_kexec_64.c |   12 ++++++------
 2 files changed, 13 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 6080d26..cedd204 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -157,9 +157,13 @@ struct kimage_arch {
 };
 #else
 struct kimage_arch {
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
+	pgd_t *pgd;
+	pud_t *pud0;
+	pud_t *pud1;
+	pmd_t *pmd0;
+	pmd_t *pmd1;
+	pte_t *pte0;
+	pte_t *pte1;
 };
 #endif
 
diff --git a/arch/x86/kernel/machine_kexec_64.c
b/arch/x86/kernel/machine_kexec_64.c
index b3ea9db..976e54b 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -137,9 +137,9 @@ out:
 
 static void free_transition_pgtable(struct kimage *image)
 {
-	free_page((unsigned long)image->arch.pud);
-	free_page((unsigned long)image->arch.pmd);
-	free_page((unsigned long)image->arch.pte);
+	free_page((unsigned long)image->arch.pud0);
+	free_page((unsigned long)image->arch.pmd0);
+	free_page((unsigned long)image->arch.pte0);
 }
 
 static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
@@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image,
pgd_t *pgd)
 		pud = (pud_t *)get_zeroed_page(GFP_KERNEL);
 		if (!pud)
 			goto err;
-		image->arch.pud = pud;
+		image->arch.pud0 = pud;
 		set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
 	}
 	pud = pud_offset(pgd, vaddr);
@@ -165,7 +165,7 @@ static int init_transition_pgtable(struct kimage *image,
pgd_t *pgd)
 		pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL);
 		if (!pmd)
 			goto err;
-		image->arch.pmd = pmd;
+		image->arch.pmd0 = pmd;
 		set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
 	}
 	pmd = pmd_offset(pud, vaddr);
@@ -173,7 +173,7 @@ static int init_transition_pgtable(struct kimage *image,
pgd_t *pgd)
 		pte = (pte_t *)get_zeroed_page(GFP_KERNEL);
 		if (!pte)
 			goto err;
-		image->arch.pte = pte;
+		image->arch.pte0 = pte;
 		set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
 	}
 	pte = pte_offset_kernel(pmd, vaddr);
-- 
1.5.6.5
Daniel Kiper
2012-Dec-27  02:18 UTC
[PATCH v3 03/11] xen: Introduce architecture independent data for kexec/kdump
Introduce architecture independent constants and structures
required by Xen kexec/kdump implementation.
Signed-off-by: Daniel Kiper <daniel.kiper at oracle.com>
---
 include/xen/interface/xen.h |   33 +++++++++++++++++++++++++++++++++
 1 files changed, 33 insertions(+), 0 deletions(-)
diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
index 886a5d8..09c16ab 100644
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -57,6 +57,7 @@
 #define __HYPERVISOR_event_channel_op     32
 #define __HYPERVISOR_physdev_op           33
 #define __HYPERVISOR_hvm_op               34
+#define __HYPERVISOR_kexec_op             37
 #define __HYPERVISOR_tmem_op              38
 
 /* Architecture-specific hypercall definitions. */
@@ -231,7 +232,39 @@ DEFINE_GUEST_HANDLE_STRUCT(mmuext_op);
 #define VMASST_TYPE_pae_extended_cr3     3
 #define MAX_VMASST_TYPE 3
 
+/*
+ * Commands to HYPERVISOR_kexec_op().
+ */
+#define KEXEC_CMD_kexec			0
+#define KEXEC_CMD_kexec_load		1
+#define KEXEC_CMD_kexec_unload		2
+#define KEXEC_CMD_kexec_get_range	3
+
+/*
+ * Memory ranges for kdump (utilized by HYPERVISOR_kexec_op()).
+ */
+#define KEXEC_RANGE_MA_CRASH		0
+#define KEXEC_RANGE_MA_XEN		1
+#define KEXEC_RANGE_MA_CPU		2
+#define KEXEC_RANGE_MA_XENHEAP		3
+#define KEXEC_RANGE_MA_BOOT_PARAM	4
+#define KEXEC_RANGE_MA_EFI_MEMMAP	5
+#define KEXEC_RANGE_MA_VMCOREINFO	6
+
 #ifndef __ASSEMBLY__
+struct xen_kexec_exec {
+	int type;
+};
+
+struct xen_kexec_range {
+	int range;
+	int nr;
+	unsigned long size;
+	unsigned long start;
+};
+
+extern unsigned long xen_vmcoreinfo_maddr;
+extern unsigned long xen_vmcoreinfo_max_size;
 
 typedef uint16_t domid_t;
 
-- 
1.5.6.5
Daniel Kiper
2012-Dec-27  02:18 UTC
[PATCH v3 04/11] x86/xen: Introduce architecture dependent data for kexec/kdump
Introduce architecture dependent constants, structures and
functions required by Xen kexec/kdump implementation.
Signed-off-by: Daniel Kiper <daniel.kiper at oracle.com>
---
 arch/x86/include/asm/xen/hypercall.h |    6 +++
 arch/x86/include/asm/xen/kexec.h     |   79 ++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/xen/kexec.h
diff --git a/arch/x86/include/asm/xen/hypercall.h
b/arch/x86/include/asm/xen/hypercall.h
index c20d1ce..e76a1b8 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -459,6 +459,12 @@ HYPERVISOR_hvm_op(int op, void *arg)
 }
 
 static inline int
+HYPERVISOR_kexec_op(unsigned long op, void *args)
+{
+	return _hypercall2(int, kexec_op, op, args);
+}
+
+static inline int
 HYPERVISOR_tmem_op(
 	struct tmem_op *op)
 {
diff --git a/arch/x86/include/asm/xen/kexec.h b/arch/x86/include/asm/xen/kexec.h
new file mode 100644
index 0000000..d09b52f
--- /dev/null
+++ b/arch/x86/include/asm/xen/kexec.h
@@ -0,0 +1,79 @@
+/*
+ * Copyright (c) 2011 Daniel Kiper
+ * Copyright (c) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * kexec/kdump implementation for Xen was written by Daniel Kiper.
+ * Initial work on it was sponsored by Google under Google Summer
+ * of Code 2011 program and Citrix. Konrad Rzeszutek Wilk from Oracle
+ * was the mentor for this project.
+ *
+ * Some ideas are taken from:
+ *   - native kexec/kdump implementation,
+ *   - kexec/kdump implementation for Xen Linux Kernel Ver. 2.6.18,
+ *   - PV-GRUB.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _ASM_X86_XEN_KEXEC_H
+#define _ASM_X86_XEN_KEXEC_H
+
+#define KEXEC_XEN_NO_PAGES	17
+
+#define XK_MA_CONTROL_PAGE	0
+#define XK_VA_CONTROL_PAGE	1
+#define XK_MA_PGD_PAGE		2
+#define XK_VA_PGD_PAGE		3
+#define XK_MA_PUD0_PAGE		4
+#define XK_VA_PUD0_PAGE		5
+#define XK_MA_PUD1_PAGE		6
+#define XK_VA_PUD1_PAGE		7
+#define XK_MA_PMD0_PAGE		8
+#define XK_VA_PMD0_PAGE		9
+#define XK_MA_PMD1_PAGE		10
+#define XK_VA_PMD1_PAGE		11
+#define XK_MA_PTE0_PAGE		12
+#define XK_VA_PTE0_PAGE		13
+#define XK_MA_PTE1_PAGE		14
+#define XK_VA_PTE1_PAGE		15
+#define XK_MA_TABLE_PAGE	16
+
+#ifndef __ASSEMBLY__
+struct xen_kexec_image {
+	unsigned long page_list[KEXEC_XEN_NO_PAGES];
+	unsigned long indirection_page;
+	unsigned long start_address;
+};
+
+struct xen_kexec_load {
+	int type;
+	struct xen_kexec_image image;
+};
+
+extern unsigned int xen_kexec_control_code_size;
+
+#ifdef CONFIG_X86_32
+extern void xen_relocate_kernel(unsigned long indirection_page,
+				unsigned long *page_list,
+				unsigned long start_address,
+				unsigned int has_pae,
+				unsigned int preserve_context);
+#else
+extern void xen_relocate_kernel(unsigned long indirection_page,
+				unsigned long *page_list,
+				unsigned long start_address,
+				unsigned int preserve_context);
+#endif
+#endif
+#endif /* _ASM_X86_XEN_KEXEC_H */
-- 
1.5.6.5
Daniel Kiper
2012-Dec-27  02:18 UTC
[PATCH v3 05/11] x86/xen: Register resources required by kexec-tools
Register resources required by kexec-tools.
v2 - suggestions/fixes:
   - change logging level
     (suggested by Konrad Rzeszutek Wilk).
Signed-off-by: Daniel Kiper <daniel.kiper at oracle.com>
---
 arch/x86/xen/kexec.c |  150 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 150 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/xen/kexec.c
diff --git a/arch/x86/xen/kexec.c b/arch/x86/xen/kexec.c
new file mode 100644
index 0000000..7ec4c45
--- /dev/null
+++ b/arch/x86/xen/kexec.c
@@ -0,0 +1,150 @@
+/*
+ * Copyright (c) 2011 Daniel Kiper
+ * Copyright (c) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * kexec/kdump implementation for Xen was written by Daniel Kiper.
+ * Initial work on it was sponsored by Google under Google Summer
+ * of Code 2011 program and Citrix. Konrad Rzeszutek Wilk from Oracle
+ * was the mentor for this project.
+ *
+ * Some ideas are taken from:
+ *   - native kexec/kdump implementation,
+ *   - kexec/kdump implementation for Xen Linux Kernel Ver. 2.6.18,
+ *   - PV-GRUB.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/ioport.h>
+#include <linux/kernel.h>
+#include <linux/kexec.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+#include <xen/interface/platform.h>
+#include <xen/interface/xen.h>
+#include <xen/xen.h>
+
+#include <asm/xen/hypercall.h>
+
+unsigned long xen_vmcoreinfo_maddr = 0;
+unsigned long xen_vmcoreinfo_max_size = 0;
+
+static int __init xen_init_kexec_resources(void)
+{
+	int rc;
+	static struct resource xen_hypervisor_res = {
+		.name = "Hypervisor code and data",
+		.flags = IORESOURCE_BUSY | IORESOURCE_MEM
+	};
+	struct resource *cpu_res;
+	struct xen_kexec_range xkr;
+	struct xen_platform_op cpuinfo_op;
+	uint32_t cpus, i;
+
+	if (!xen_initial_domain())
+		return 0;
+
+	if (strstr(boot_command_line, "crashkernel="))
+		pr_warn("kexec: Ignoring crashkernel option. "
+			"It should be passed to Xen hypervisor.\n");
+
+	/* Register Crash kernel resource. */
+	xkr.range = KEXEC_RANGE_MA_CRASH;
+	rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_get_range, &xkr);
+
+	if (rc) {
+		pr_warn("kexec: %s: HYPERVISOR_kexec_op(KEXEC_RANGE_MA_CRASH)"
+			": %i\n", __func__, rc);
+		return rc;
+	}
+
+	if (!xkr.size)
+		return 0;
+
+	crashk_res.start = xkr.start;
+	crashk_res.end = xkr.start + xkr.size - 1;
+	insert_resource(&iomem_resource, &crashk_res);
+
+	/* Register Hypervisor code and data resource. */
+	xkr.range = KEXEC_RANGE_MA_XEN;
+	rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_get_range, &xkr);
+
+	if (rc) {
+		pr_warn("kexec: %s: HYPERVISOR_kexec_op(KEXEC_RANGE_MA_XEN)"
+			": %i\n", __func__, rc);
+		return rc;
+	}
+
+	xen_hypervisor_res.start = xkr.start;
+	xen_hypervisor_res.end = xkr.start + xkr.size - 1;
+	insert_resource(&iomem_resource, &xen_hypervisor_res);
+
+	/* Determine maximum number of physical CPUs. */
+	cpuinfo_op.cmd = XENPF_get_cpuinfo;
+	cpuinfo_op.u.pcpu_info.xen_cpuid = 0;
+	rc = HYPERVISOR_dom0_op(&cpuinfo_op);
+
+	if (rc) {
+		pr_warn("kexec: %s: HYPERVISOR_dom0_op(): %i\n", __func__, rc);
+		return rc;
+	}
+
+	cpus = cpuinfo_op.u.pcpu_info.max_present + 1;
+
+	/* Register CPUs Crash note resources. */
+	cpu_res = kcalloc(cpus, sizeof(struct resource), GFP_KERNEL);
+
+	if (!cpu_res) {
+		pr_warn("kexec: %s: kcalloc(): %i\n", __func__, -ENOMEM);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < cpus; ++i) {
+		xkr.range = KEXEC_RANGE_MA_CPU;
+		xkr.nr = i;
+		rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_get_range, &xkr);
+
+		if (rc) {
+			pr_warn("kexec: %s: cpu: %u: HYPERVISOR_kexec_op"
+				"(KEXEC_RANGE_MA_XEN): %i\n", __func__, i, rc);
+			continue;
+		}
+
+		cpu_res->name = "Crash note";
+		cpu_res->start = xkr.start;
+		cpu_res->end = xkr.start + xkr.size - 1;
+		cpu_res->flags = IORESOURCE_BUSY | IORESOURCE_MEM;
+		insert_resource(&iomem_resource, cpu_res++);
+	}
+
+	/* Get vmcoreinfo address and maximum allowed size. */
+	xkr.range = KEXEC_RANGE_MA_VMCOREINFO;
+	rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_get_range, &xkr);
+
+	if (rc) {
+		pr_warn("kexec: %s: HYPERVISOR_kexec_op(KEXEC_RANGE_MA_VMCOREINFO)"
+			": %i\n", __func__, rc);
+		return rc;
+	}
+
+	xen_vmcoreinfo_maddr = xkr.start;
+	xen_vmcoreinfo_max_size = xkr.size;
+
+	return 0;
+}
+
+core_initcall(xen_init_kexec_resources);
-- 
1.5.6.5
Daniel Kiper
2012-Dec-27  02:18 UTC
[PATCH v3 09/11] x86/xen/enlighten: Add init and crash kexec/kdump hooks
Add init and crash kexec/kdump hooks.
Signed-off-by: Daniel Kiper <daniel.kiper at oracle.com>
---
 arch/x86/xen/enlighten.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 138e566..5025bba 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -31,6 +31,7 @@
 #include <linux/pci.h>
 #include <linux/gfp.h>
 #include <linux/memblock.h>
+#include <linux/kexec.h>
 
 #include <xen/xen.h>
 #include <xen/events.h>
@@ -1276,6 +1277,12 @@ static void xen_machine_power_off(void)
 
 static void xen_crash_shutdown(struct pt_regs *regs)
 {
+#ifdef CONFIG_KEXEC_FIRMWARE
+	if (kexec_crash_image) {
+		crash_save_cpu(regs, safe_smp_processor_id());
+		return;
+	}
+#endif
 	xen_reboot(SHUTDOWN_crash);
 }
 
@@ -1353,6 +1360,10 @@ asmlinkage void __init xen_start_kernel(void)
 
 	xen_init_mmu_ops();
 
+#ifdef CONFIG_KEXEC_FIRMWARE
+	kexec_use_firmware = true;
+#endif
+
 	/* Prevent unwanted bits from being set in PTEs. */
 	__supported_pte_mask &= ~_PAGE_GLOBAL;
 #if 0
-- 
1.5.6.5
Daniel Kiper
2012-Dec-27  02:18 UTC
[PATCH v3 10/11] drivers/xen: Export vmcoreinfo through sysfs
Export vmcoreinfo through sysfs.
Signed-off-by: Daniel Kiper <daniel.kiper at oracle.com>
---
 drivers/xen/sys-hypervisor.c |   42 +++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 41 insertions(+), 1 deletions(-)
diff --git a/drivers/xen/sys-hypervisor.c b/drivers/xen/sys-hypervisor.c
index 96453f8..9dd290c 100644
--- a/drivers/xen/sys-hypervisor.c
+++ b/drivers/xen/sys-hypervisor.c
@@ -368,6 +368,41 @@ static void xen_properties_destroy(void)
 	sysfs_remove_group(hypervisor_kobj, &xen_properties_group);
 }
 
+#ifdef CONFIG_KEXEC_FIRMWARE
+static ssize_t vmcoreinfo_show(struct hyp_sysfs_attr *attr, char *buffer)
+{
+	return sprintf(buffer, "%lx %lx\n", xen_vmcoreinfo_maddr,
+						xen_vmcoreinfo_max_size);
+}
+
+HYPERVISOR_ATTR_RO(vmcoreinfo);
+
+static int __init xen_vmcoreinfo_init(void)
+{
+	if (!xen_vmcoreinfo_max_size)
+		return 0;
+
+	return sysfs_create_file(hypervisor_kobj, &vmcoreinfo_attr.attr);
+}
+
+static void xen_vmcoreinfo_destroy(void)
+{
+	if (!xen_vmcoreinfo_max_size)
+		return;
+
+	sysfs_remove_file(hypervisor_kobj, &vmcoreinfo_attr.attr);
+}
+#else
+static int __init xen_vmcoreinfo_init(void)
+{
+	return 0;
+}
+
+static void xen_vmcoreinfo_destroy(void)
+{
+}
+#endif
+
 static int __init hyper_sysfs_init(void)
 {
 	int ret;
@@ -390,9 +425,14 @@ static int __init hyper_sysfs_init(void)
 	ret = xen_properties_init();
 	if (ret)
 		goto prop_out;
+	ret = xen_vmcoreinfo_init();
+	if (ret)
+		goto vmcoreinfo_out;
 
 	goto out;
 
+vmcoreinfo_out:
+	xen_properties_destroy();
 prop_out:
 	xen_sysfs_uuid_destroy();
 uuid_out:
@@ -407,12 +447,12 @@ out:
 
 static void __exit hyper_sysfs_exit(void)
 {
+	xen_vmcoreinfo_destroy();
 	xen_properties_destroy();
 	xen_compilation_destroy();
 	xen_sysfs_uuid_destroy();
 	xen_sysfs_version_destroy();
 	xen_sysfs_type_destroy();
-
 }
 module_init(hyper_sysfs_init);
 module_exit(hyper_sysfs_exit);
-- 
1.5.6.5
Daniel Kiper
2012-Dec-27  02:19 UTC
[PATCH v3 11/11] x86: Add Xen kexec control code size check to linker script
Add Xen kexec control code size check to linker script.
Signed-off-by: Daniel Kiper <daniel.kiper at oracle.com>
---
 arch/x86/kernel/vmlinux.lds.S |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 22a1530..f18786a 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -360,5 +360,10 @@ INIT_PER_CPU(irq_stack_union);
 
 . = ASSERT(kexec_control_code_size <= KEXEC_CONTROL_CODE_MAX_SIZE,
            "kexec control code size is too big");
-#endif
 
+#ifdef CONFIG_XEN
+. = ASSERT(xen_kexec_control_code_size - xen_relocate_kernel <+	
KEXEC_CONTROL_CODE_MAX_SIZE,
+		"Xen kexec control code size is too big");
+#endif
+#endif
-- 
1.5.6.5
H. Peter Anvin
2012-Dec-27  03:33 UTC
[PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
Hmm... this code is being redone at the moment... this might conflict. Daniel Kiper <daniel.kiper at oracle.com> wrote:>Some implementations (e.g. Xen PVOPS) could not use part of identity >page table >to construct transition page table. It means that they require separate >PUDs, >PMDs and PTEs for virtual and physical (identity) mapping. To satisfy >that >requirement add extra pointer to PGD, PUD, PMD and PTE and align >existing code. > >Signed-off-by: Daniel Kiper <daniel.kiper at oracle.com> >--- > arch/x86/include/asm/kexec.h | 10 +++++++--- > arch/x86/kernel/machine_kexec_64.c | 12 ++++++------ > 2 files changed, 13 insertions(+), 9 deletions(-) > >diff --git a/arch/x86/include/asm/kexec.h >b/arch/x86/include/asm/kexec.h >index 6080d26..cedd204 100644 >--- a/arch/x86/include/asm/kexec.h >+++ b/arch/x86/include/asm/kexec.h >@@ -157,9 +157,13 @@ struct kimage_arch { > }; > #else > struct kimage_arch { >- pud_t *pud; >- pmd_t *pmd; >- pte_t *pte; >+ pgd_t *pgd; >+ pud_t *pud0; >+ pud_t *pud1; >+ pmd_t *pmd0; >+ pmd_t *pmd1; >+ pte_t *pte0; >+ pte_t *pte1; > }; > #endif > >diff --git a/arch/x86/kernel/machine_kexec_64.c >b/arch/x86/kernel/machine_kexec_64.c >index b3ea9db..976e54b 100644 >--- a/arch/x86/kernel/machine_kexec_64.c >+++ b/arch/x86/kernel/machine_kexec_64.c >@@ -137,9 +137,9 @@ out: > > static void free_transition_pgtable(struct kimage *image) > { >- free_page((unsigned long)image->arch.pud); >- free_page((unsigned long)image->arch.pmd); >- free_page((unsigned long)image->arch.pte); >+ free_page((unsigned long)image->arch.pud0); >+ free_page((unsigned long)image->arch.pmd0); >+ free_page((unsigned long)image->arch.pte0); > } > > static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) >@@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage >*image, pgd_t *pgd) > pud = (pud_t *)get_zeroed_page(GFP_KERNEL); > if (!pud) > goto err; >- image->arch.pud = pud; >+ image->arch.pud0 = pud; > set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE)); > } > pud = pud_offset(pgd, vaddr); >@@ -165,7 +165,7 @@ static int init_transition_pgtable(struct kimage >*image, pgd_t *pgd) > pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL); > if (!pmd) > goto err; >- image->arch.pmd = pmd; >+ image->arch.pmd0 = pmd; > set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE)); > } > pmd = pmd_offset(pud, vaddr); >@@ -173,7 +173,7 @@ static int init_transition_pgtable(struct kimage >*image, pgd_t *pgd) > pte = (pte_t *)get_zeroed_page(GFP_KERNEL); > if (!pte) > goto err; >- image->arch.pte = pte; >+ image->arch.pte0 = pte; > set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE)); > } > pte = pte_offset_kernel(pmd, vaddr);-- Sent from my mobile phone. Please excuse brevity and lack of formatting.
H. Peter Anvin
2012-Dec-27  04:02 UTC
[PATCH v3 00/11] xen: Initial kexec/kdump implementation
On 12/26/2012 06:18 PM, Daniel Kiper wrote:> Hi, > > This set of patches contains initial kexec/kdump implementation for Xen v3. > Currently only dom0 is supported, however, almost all infrustructure > required for domU support is ready. > > Jan Beulich suggested to merge Xen x86 assembler code with baremetal x86 code. > This could simplify and reduce a bit size of kernel code. However, this solution > requires some changes in baremetal x86 code. First of all code which establishes > transition page table should be moved back from machine_kexec_$(BITS).c to > relocate_kernel_$(BITS).S. Another important thing which should be changed in that > case is format of page_list array. Xen kexec hypercall requires to alternate physical > addresses with virtual ones. These and other required stuff have not been done in that > version because I am not sure that solution will be accepted by kexec/kdump maintainers. > I hope that this email spark discussion about that topic. >I want a detailed list of the constraints that this assumes and therefore imposes on the native implementation as a result of this. We have had way too many patches where Xen PV hacks effectively nailgun arbitrary, and sometimes poor, design decisions in place and now we can't fix them. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Daniel Kiper
2012-Dec-27  23:40 UTC
[PATCH v3 00/11] xen: Initial kexec/kdump implementation
> On 12/26/2012 06:18 PM, Daniel Kiper wrote: > > Hi, > > > > This set of patches contains initial kexec/kdump implementation for Xen v3. > > Currently only dom0 is supported, however, almost all infrustructure > > required for domU support is ready. > > > > Jan Beulich suggested to merge Xen x86 assembler code with baremetal x86 code. > > This could simplify and reduce a bit size of kernel code. However, this solution > > requires some changes in baremetal x86 code. First of all code which establishes > > transition page table should be moved back from machine_kexec_$(BITS).c to > > relocate_kernel_$(BITS).S. Another important thing which should be changed in that > > case is format of page_list array. Xen kexec hypercall requires to alternate physical > > addresses with virtual ones. These and other required stuff have not been done in that > > version because I am not sure that solution will be accepted by kexec/kdump maintainers. > > I hope that this email spark discussion about that topic. > > I want a detailed list of the constraints that this assumes and > therefore imposes on the native implementation as a result of this. We > have had way too many patches where Xen PV hacks effectively nailgun > arbitrary, and sometimes poor, design decisions in place and now we > can't fix them.OK but now I think that we should leave this discussion until all details regarding kexec/kdump generic code will be agreed. Sorry for that. Daniel
Daniel Kiper
2012-Dec-28  00:53 UTC
[PATCH v3 00/11] xen: Initial kexec/kdump implementation
> Andrew Cooper <andrew.cooper3 at citrix.com> writes: > > > On 27/12/2012 07:53, Eric W. Biederman wrote: > >> The syscall ABI still has the wrong semantics. > >> > >> Aka totally unmaintainable and umergeable. > >> > >> The concept of domU support is also strange. What does domU support even mean, when the dom0 > support is loading a kernel to pick up Xen when Xen falls over. > > > > There are two requirements pulling at this patch series, but I agree > > that we need to clarify them. > > It probably make sense to split them apart a little even. > > > When dom0 loads a crash kernel, it is loading one for Xen to use. As a > > dom0 crash causes a Xen crash, having dom0 set up a kdump kernel for > > itself is completely useless. This ability is present in "classic Xen > > dom0" kernels, but the feature is currently missing in PVOPS. > > > Many cloud customers and service providers want the ability for a VM > > administrator to be able to load a kdump/kexec kernel within a > > domain[1]. This allows the VM administrator to take more proactive > > steps to isolate the cause of a crash, the state of which is most likely > > discarded while tearing down the domain. The result being that as far > > as Xen is concerned, the domain is still alive, while the kdump > > kernel/environment can work its usual magic. I am not aware of any > > feature like this existing in the past. > > Which makes domU support semantically just the normal kexec/kdump > support. Got it.To some extent. It is true on HVM and PVonHVM guests. However, PV guests requires a bit different kexec/kdump implementation than plain kexec/kdump. Proposed firmware support has almost all required features. PV guest specific features (a few) will be added later (after agreeing generic firmware support which is sufficient at least for dom0). It looks that I should replace domU by PV guest in patch description.> The point of implementing domU is for those times when the hypervisor > admin and the kernel admin are different.Right.> For domU support modifying or adding alternate versions of > machine_kexec.c and relocate_kernel.S to add paravirtualization support > make sense.It is not sufficient. Please look above.> There is the practical argument that for implementation efficiency of > crash dumps it would be better if that support came from the hypervisor > or the hypervisor environment. But this gets into the practical realityI am thinking about that.> that the hypervisor environment does not do that today. Furthermore > kexec all by itself working in a paravirtualized environment under Xen > makes sense. > > domU support is what Peter was worrying about for cleanliness, and > we need some x86 backend ops there, and generally to be careful.As I know we do not need any additional pv_ops stuff if we place all needed things in kexec firmware support. Daniel
Jan Beulich
2013-Jan-03  09:34 UTC
Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
>>> On 27.12.12 at 03:18, Daniel Kiper <daniel.kiper@oracle.com> wrote: > Some implementations (e.g. Xen PVOPS) could not use part of identity page > table > to construct transition page table. It means that they require separate > PUDs, > PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that > requirement add extra pointer to PGD, PUD, PMD and PTE and align existing > code.So you keep posting this despite it having got pointed out on each earlier submission that this is unnecessary, proven by the fact that the non-pvops Xen kernels can get away without it. Why? Jan
Jan Beulich
2013-Jan-03  09:34 UTC
[PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
>>> On 27.12.12 at 03:18, Daniel Kiper <daniel.kiper at oracle.com> wrote: > Some implementations (e.g. Xen PVOPS) could not use part of identity page > table > to construct transition page table. It means that they require separate > PUDs, > PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that > requirement add extra pointer to PGD, PUD, PMD and PTE and align existing > code.So you keep posting this despite it having got pointed out on each earlier submission that this is unnecessary, proven by the fact that the non-pvops Xen kernels can get away without it. Why? Jan
Daniel Kiper
2013-Jan-04  14:22 UTC
[Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On Wed, Jan 02, 2013 at 11:26:43AM +0000, Andrew Cooper wrote:> On 27/12/12 18:02, Eric W. Biederman wrote: > >Andrew Cooper<andrew.cooper3 at citrix.com> writes: > > > >>On 27/12/2012 07:53, Eric W. Biederman wrote: > >>>The syscall ABI still has the wrong semantics. > >>> > >>>Aka totally unmaintainable and umergeable. > >>> > >>>The concept of domU support is also strange. What does domU support even mean, when the dom0 support is loading a kernel to pick up Xen when Xen falls over. > >>There are two requirements pulling at this patch series, but I agree > >>that we need to clarify them. > >It probably make sense to split them apart a little even. > > > > > > Thinking about this split, there might be a way to simply it even more. > > /sbin/kexec can load the "Xen" crash kernel itself by issuing > hypercalls using /dev/xen/privcmd. This would remove the need for > the dom0 kernel to distinguish between loading a crash kernel for > itself and loading a kernel for Xen. > > Or is this just a silly idea complicating the matter?This is impossible with current Xen kexec/kdump interface. It should be changed to do that. However, I suppose that Xen community would not be interested in such changes. Daniel
Konrad Rzeszutek Wilk
2013-Jan-04  14:34 UTC
[Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On Fri, Jan 04, 2013 at 03:22:57PM +0100, Daniel Kiper wrote:> On Wed, Jan 02, 2013 at 11:26:43AM +0000, Andrew Cooper wrote: > > On 27/12/12 18:02, Eric W. Biederman wrote: > > >Andrew Cooper<andrew.cooper3 at citrix.com> writes: > > > > > >>On 27/12/2012 07:53, Eric W. Biederman wrote: > > >>>The syscall ABI still has the wrong semantics. > > >>> > > >>>Aka totally unmaintainable and umergeable. > > >>> > > >>>The concept of domU support is also strange. What does domU support even mean, when the dom0 support is loading a kernel to pick up Xen when Xen falls over. > > >>There are two requirements pulling at this patch series, but I agree > > >>that we need to clarify them. > > >It probably make sense to split them apart a little even. > > > > > > > > > > Thinking about this split, there might be a way to simply it even more. > > > > /sbin/kexec can load the "Xen" crash kernel itself by issuing > > hypercalls using /dev/xen/privcmd. This would remove the need for > > the dom0 kernel to distinguish between loading a crash kernel for > > itself and loading a kernel for Xen. > > > > Or is this just a silly idea complicating the matter? > > This is impossible with current Xen kexec/kdump interface. > It should be changed to do that. However, I suppose that > Xen community would not be interested in such changes.Why not? What is involved in it? IMHO I believe anybody would welcome a new clean design that solves this thorny problem?
Jan Beulich
2013-Jan-04  14:41 UTC
[Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
>>> On 04.01.13 at 15:22, Daniel Kiper <daniel.kiper at oracle.com> wrote: > On Wed, Jan 02, 2013 at 11:26:43AM +0000, Andrew Cooper wrote: >> /sbin/kexec can load the "Xen" crash kernel itself by issuing >> hypercalls using /dev/xen/privcmd. This would remove the need for >> the dom0 kernel to distinguish between loading a crash kernel for >> itself and loading a kernel for Xen. >> >> Or is this just a silly idea complicating the matter? > > This is impossible with current Xen kexec/kdump interface.Why?> It should be changed to do that. However, I suppose that > Xen community would not be interested in such changes.And again - why? Jan
Daniel Kiper
2013-Jan-04  15:15 UTC
Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On Thu, Jan 03, 2013 at 09:34:55AM +0000, Jan Beulich wrote:> >>> On 27.12.12 at 03:18, Daniel Kiper <daniel.kiper@oracle.com> wrote: > > Some implementations (e.g. Xen PVOPS) could not use part of identity page table > > to construct transition page table. It means that they require separate PUDs, > > PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that > > requirement add extra pointer to PGD, PUD, PMD and PTE and align existing > > code. > > So you keep posting this despite it having got pointed out on each > earlier submission that this is unnecessary, proven by the fact that > the non-pvops Xen kernels can get away without it. Why?Sorry but I forgot to reply for your email last time. I am still not convinced. I have tested SUSE kernel itself and it does not work. Maybe I missed something but... Please check arch/x86/kernel/machine_kexec_64.c:init_transition_pgtable() I can see: vaddr = (unsigned long)relocate_kernel; and later: pgd += pgd_index(vaddr); ... It is wrong. relocate_kernel() virtual address in Xen is different than its virtual address in Linux Kernel. That is why transition page table could not be established in Linux Kernel and so on... How does this work in SUSE? I do not have an idea. I am happy to fix that but whatever fix for it is I would like to be sure that it works. Daniel
Daniel Kiper
2013-Jan-04  15:15 UTC
[PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On Thu, Jan 03, 2013 at 09:34:55AM +0000, Jan Beulich wrote:> >>> On 27.12.12 at 03:18, Daniel Kiper <daniel.kiper at oracle.com> wrote: > > Some implementations (e.g. Xen PVOPS) could not use part of identity page table > > to construct transition page table. It means that they require separate PUDs, > > PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that > > requirement add extra pointer to PGD, PUD, PMD and PTE and align existing > > code. > > So you keep posting this despite it having got pointed out on each > earlier submission that this is unnecessary, proven by the fact that > the non-pvops Xen kernels can get away without it. Why?Sorry but I forgot to reply for your email last time. I am still not convinced. I have tested SUSE kernel itself and it does not work. Maybe I missed something but... Please check arch/x86/kernel/machine_kexec_64.c:init_transition_pgtable() I can see: vaddr = (unsigned long)relocate_kernel; and later: pgd += pgd_index(vaddr); ... It is wrong. relocate_kernel() virtual address in Xen is different than its virtual address in Linux Kernel. That is why transition page table could not be established in Linux Kernel and so on... How does this work in SUSE? I do not have an idea. I am happy to fix that but whatever fix for it is I would like to be sure that it works. Daniel
Jan Beulich
2013-Jan-04  16:12 UTC
[PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
>>> On 04.01.13 at 16:15, Daniel Kiper <daniel.kiper at oracle.com> wrote: > On Thu, Jan 03, 2013 at 09:34:55AM +0000, Jan Beulich wrote: >> >>> On 27.12.12 at 03:18, Daniel Kiper <daniel.kiper at oracle.com> wrote: >> > Some implementations (e.g. Xen PVOPS) could not use part of identity page > table >> > to construct transition page table. It means that they require separate > PUDs, >> > PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that >> > requirement add extra pointer to PGD, PUD, PMD and PTE and align existing >> > code. >> >> So you keep posting this despite it having got pointed out on each >> earlier submission that this is unnecessary, proven by the fact that >> the non-pvops Xen kernels can get away without it. Why? > > Sorry but I forgot to reply for your email last time. > > I am still not convinced. I have tested SUSE kernel itself and it does not > work. > Maybe I missed something but... Please check > arch/x86/kernel/machine_kexec_64.c:init_transition_pgtable() > > I can see: > > vaddr = (unsigned long)relocate_kernel; > > and later: > > pgd += pgd_index(vaddr); > ...I think that mapping is simply irrelevant, as the code at relocate_kernel gets copied to the control page and invoked there (other than in the native case, where relocate_kernel() gets invoked directly). Jan> It is wrong. relocate_kernel() virtual address in Xen is different > than its virtual address in Linux Kernel. That is why transition > page table could not be established in Linux Kernel and so on... > How does this work in SUSE? I do not have an idea. > > I am happy to fix that but whatever fix for it is > I would like to be sure that it works. > > Daniel
Jan Beulich
2013-Jan-04  16:12 UTC
Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
>>> On 04.01.13 at 16:15, Daniel Kiper <daniel.kiper@oracle.com> wrote: > On Thu, Jan 03, 2013 at 09:34:55AM +0000, Jan Beulich wrote: >> >>> On 27.12.12 at 03:18, Daniel Kiper <daniel.kiper@oracle.com> wrote: >> > Some implementations (e.g. Xen PVOPS) could not use part of identity page > table >> > to construct transition page table. It means that they require separate > PUDs, >> > PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that >> > requirement add extra pointer to PGD, PUD, PMD and PTE and align existing >> > code. >> >> So you keep posting this despite it having got pointed out on each >> earlier submission that this is unnecessary, proven by the fact that >> the non-pvops Xen kernels can get away without it. Why? > > Sorry but I forgot to reply for your email last time. > > I am still not convinced. I have tested SUSE kernel itself and it does not > work. > Maybe I missed something but... Please check > arch/x86/kernel/machine_kexec_64.c:init_transition_pgtable() > > I can see: > > vaddr = (unsigned long)relocate_kernel; > > and later: > > pgd += pgd_index(vaddr); > ...I think that mapping is simply irrelevant, as the code at relocate_kernel gets copied to the control page and invoked there (other than in the native case, where relocate_kernel() gets invoked directly). Jan> It is wrong. relocate_kernel() virtual address in Xen is different > than its virtual address in Linux Kernel. That is why transition > page table could not be established in Linux Kernel and so on... > How does this work in SUSE? I do not have an idea. > > I am happy to fix that but whatever fix for it is > I would like to be sure that it works. > > Daniel
Daniel Kiper
2013-Jan-04  17:07 UTC
[Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On Fri, Jan 04, 2013 at 02:41:17PM +0000, Jan Beulich wrote:> >>> On 04.01.13 at 15:22, Daniel Kiper <daniel.kiper at oracle.com> wrote: > > On Wed, Jan 02, 2013 at 11:26:43AM +0000, Andrew Cooper wrote: > >> /sbin/kexec can load the "Xen" crash kernel itself by issuing > >> hypercalls using /dev/xen/privcmd. This would remove the need for > >> the dom0 kernel to distinguish between loading a crash kernel for > >> itself and loading a kernel for Xen. > >> > >> Or is this just a silly idea complicating the matter? > > > > This is impossible with current Xen kexec/kdump interface. > > Why?Because current KEXEC_CMD_kexec_load does not load kernel image and other things into Xen memory. It means that it should live somewhere in dom0 Linux kernel memory. Daniel
Daniel Kiper
2013-Jan-04  17:25 UTC
Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On Fri, Jan 04, 2013 at 04:12:32PM +0000, Jan Beulich wrote:> >>> On 04.01.13 at 16:15, Daniel Kiper <daniel.kiper@oracle.com> wrote: > > On Thu, Jan 03, 2013 at 09:34:55AM +0000, Jan Beulich wrote: > >> >>> On 27.12.12 at 03:18, Daniel Kiper <daniel.kiper@oracle.com> wrote: > >> > Some implementations (e.g. Xen PVOPS) could not use part of identity page table > >> > to construct transition page table. It means that they require separate PUDs, > >> > PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that > >> > requirement add extra pointer to PGD, PUD, PMD and PTE and align existing > >> > code. > >> > >> So you keep posting this despite it having got pointed out on each > >> earlier submission that this is unnecessary, proven by the fact that > >> the non-pvops Xen kernels can get away without it. Why? > > > > Sorry but I forgot to reply for your email last time. > > > > I am still not convinced. I have tested SUSE kernel itself and it does not work. > > Maybe I missed something but... Please check > > arch/x86/kernel/machine_kexec_64.c:init_transition_pgtable() > > > > I can see: > > > > vaddr = (unsigned long)relocate_kernel; > > > > and later: > > > > pgd += pgd_index(vaddr); > > ... > > I think that mapping is simply irrelevant, as the code at > relocate_kernel gets copied to the control page and > invoked there (other than in the native case, where > relocate_kernel() gets invoked directly).Right, so where is virtual mapping of control page established? I could not find relevant code in SLES kernel which does that. Daniel
Daniel Kiper
2013-Jan-04  17:25 UTC
[PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On Fri, Jan 04, 2013 at 04:12:32PM +0000, Jan Beulich wrote:> >>> On 04.01.13 at 16:15, Daniel Kiper <daniel.kiper at oracle.com> wrote: > > On Thu, Jan 03, 2013 at 09:34:55AM +0000, Jan Beulich wrote: > >> >>> On 27.12.12 at 03:18, Daniel Kiper <daniel.kiper at oracle.com> wrote: > >> > Some implementations (e.g. Xen PVOPS) could not use part of identity page table > >> > to construct transition page table. It means that they require separate PUDs, > >> > PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that > >> > requirement add extra pointer to PGD, PUD, PMD and PTE and align existing > >> > code. > >> > >> So you keep posting this despite it having got pointed out on each > >> earlier submission that this is unnecessary, proven by the fact that > >> the non-pvops Xen kernels can get away without it. Why? > > > > Sorry but I forgot to reply for your email last time. > > > > I am still not convinced. I have tested SUSE kernel itself and it does not work. > > Maybe I missed something but... Please check > > arch/x86/kernel/machine_kexec_64.c:init_transition_pgtable() > > > > I can see: > > > > vaddr = (unsigned long)relocate_kernel; > > > > and later: > > > > pgd += pgd_index(vaddr); > > ... > > I think that mapping is simply irrelevant, as the code at > relocate_kernel gets copied to the control page and > invoked there (other than in the native case, where > relocate_kernel() gets invoked directly).Right, so where is virtual mapping of control page established? I could not find relevant code in SLES kernel which does that. Daniel
Jan Beulich
2013-Jan-07  09:48 UTC
[PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
>>> On 04.01.13 at 18:25, Daniel Kiper <daniel.kiper at oracle.com> wrote: > Right, so where is virtual mapping of control page established? > I could not find relevant code in SLES kernel which does that.In the hypervisor (xen/arch/x86/machine_kexec.c:machine_kexec_load()). xen/arch/x86/machine_kexec.c:machine_kexec() then simply uses image->page_list[1]. Jan
Jan Beulich
2013-Jan-07  09:48 UTC
Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
>>> On 04.01.13 at 18:25, Daniel Kiper <daniel.kiper@oracle.com> wrote: > Right, so where is virtual mapping of control page established? > I could not find relevant code in SLES kernel which does that.In the hypervisor (xen/arch/x86/machine_kexec.c:machine_kexec_load()). xen/arch/x86/machine_kexec.c:machine_kexec() then simply uses image->page_list[1]. Jan
Daniel Kiper
2013-Jan-07  12:52 UTC
[PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On Mon, Jan 07, 2013 at 09:48:20AM +0000, Jan Beulich wrote:> >>> On 04.01.13 at 18:25, Daniel Kiper <daniel.kiper at oracle.com> wrote: > > Right, so where is virtual mapping of control page established? > > I could not find relevant code in SLES kernel which does that. > > In the hypervisor (xen/arch/x86/machine_kexec.c:machine_kexec_load()). > xen/arch/x86/machine_kexec.c:machine_kexec() then simply uses > image->page_list[1].This (xen/arch/x86/machine_kexec.c:machine_kexec_load()) maps relevant page (allocated earlier by dom0) in hypervisor fixmap area. However, it does not make relevant mapping in transition page table which leads to crash when %cr3 is switched from Xen page table to transition page table. Daniel
Daniel Kiper
2013-Jan-07  12:52 UTC
Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On Mon, Jan 07, 2013 at 09:48:20AM +0000, Jan Beulich wrote:> >>> On 04.01.13 at 18:25, Daniel Kiper <daniel.kiper@oracle.com> wrote: > > Right, so where is virtual mapping of control page established? > > I could not find relevant code in SLES kernel which does that. > > In the hypervisor (xen/arch/x86/machine_kexec.c:machine_kexec_load()). > xen/arch/x86/machine_kexec.c:machine_kexec() then simply uses > image->page_list[1].This (xen/arch/x86/machine_kexec.c:machine_kexec_load()) maps relevant page (allocated earlier by dom0) in hypervisor fixmap area. However, it does not make relevant mapping in transition page table which leads to crash when %cr3 is switched from Xen page table to transition page table. Daniel
Jan Beulich
2013-Jan-07  13:05 UTC
Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
>>> On 07.01.13 at 13:52, Daniel Kiper <daniel.kiper@oracle.com> wrote: > On Mon, Jan 07, 2013 at 09:48:20AM +0000, Jan Beulich wrote: >> >>> On 04.01.13 at 18:25, Daniel Kiper <daniel.kiper@oracle.com> wrote: >> > Right, so where is virtual mapping of control page established? >> > I could not find relevant code in SLES kernel which does that. >> >> In the hypervisor (xen/arch/x86/machine_kexec.c:machine_kexec_load()). >> xen/arch/x86/machine_kexec.c:machine_kexec() then simply uses >> image->page_list[1]. > > This (xen/arch/x86/machine_kexec.c:machine_kexec_load()) maps relevant > page (allocated earlier by dom0) in hypervisor fixmap area. However, > it does not make relevant mapping in transition page table which > leads to crash when %cr3 is switched from Xen page table to > transition page table.That indeed could explain _random_ failures - the fixmap entries get created with _PAGE_GLOBAL set, i.e. don''t get flushed with the CR3 write unless CR4.PGE is clear. And I don''t see how your allocation of intermediate page tables would help: You wouldn''t know where the mapping of the control page lives until you''re actually in the early relocate_kernel code. Or was it that what distinguishes your cloned code from the native original? Jan
David Vrabel
2013-Jan-10  14:07 UTC
Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On 04/01/13 15:15, Daniel Kiper wrote:> On Thu, Jan 03, 2013 at 09:34:55AM +0000, Jan Beulich wrote: >>>>> On 27.12.12 at 03:18, Daniel Kiper <daniel.kiper@oracle.com> wrote: >>> Some implementations (e.g. Xen PVOPS) could not use part of identity page table >>> to construct transition page table. It means that they require separate PUDs, >>> PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that >>> requirement add extra pointer to PGD, PUD, PMD and PTE and align existing >>> code. >> >> So you keep posting this despite it having got pointed out on each >> earlier submission that this is unnecessary, proven by the fact that >> the non-pvops Xen kernels can get away without it. Why? > > Sorry but I forgot to reply for your email last time. > > I am still not convinced. I have tested SUSE kernel itself and it does not work. > Maybe I missed something but... Please check arch/x86/kernel/machine_kexec_64.c:init_transition_pgtable() > > I can see: > > vaddr = (unsigned long)relocate_kernel; > > and later: > > pgd += pgd_index(vaddr); > ... > > It is wrong. relocate_kernel() virtual address in Xen is different > than its virtual address in Linux Kernel. That is why transition > page table could not be established in Linux Kernel and so on... > How does this work in SUSE? I do not have an idea.The real problem here is attempting to transition from the Xen page tables to an identity mapping set of page tables by using some trampoline code and page tables provided by the dom0 kernel. This works[*] with PV because the page tables from the PV dom0 have machine addresses and get mapped into the fixmap on kexec load, but it''s completely broken for a PVH dom0. I shall be ditching this (bizarre) method and putting the trampoline and transition/identity map page tables into Xen. David [*] Works for us in our old classic kernels, YMMV.
Maybe Matching Threads
- [PATCH v3 00/11] xen: Initial kexec/kdump implementation
- [PATCH v3 00/11] xen: Initial kexec/kdump implementation
- [PATCH v2 00/11] xen: Initial kexec/kdump implementation
- [PATCH v2 00/11] xen: Initial kexec/kdump implementation
- [PATCH v2 00/11] xen: Initial kexec/kdump implementation