Ahmed Abd El Mawgood
2018-Jul-19  21:37 UTC
Memory Read Only Enforcement: VMM assisted kernel rootkit mitigation for KVM
Hi, This is my first set of patches that works as I would expect, and the third revision I sent to mailing lists. Following up with my previous discussions about kernel rootkit mitigation via placing R/O protection on critical data structure, static data, privileged registers with static content. These patches present the first part where it is only possible to place these protections on memory pages. Feature-wise, this set of patches is incomplete in the sense of: - They still don't protect privileged registers - They don't protect guest TLB from malicious gva -> gpa page mappings. But they provide sketches for a basic working design. Note that I am totally noob and it took lots of time and effort to get to this point. So sorry in advance if I overlooked something. [PATCH 1/3] [RFC V3] KVM: X86: Memory ROE documentation [PATCH 2/3] [RFC V3] KVM: X86: Adding arbitrary data pointer in kvm memslot itterator functions [PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE Summery: Documentation/virtual/kvm/hypercalls.txt | 14 ++++ arch/x86/include/asm/kvm_host.h | 11 ++- arch/x86/kvm/Kconfig | 7 ++ arch/x86/kvm/mmu.c | 127 ++++++++++++++++++++++--------- arch/x86/kvm/x86.c | 82 +++++++++++++++++++- include/linux/kvm_host.h | 3 + include/uapi/linux/kvm_para.h | 1 + virt/kvm/kvm_main.c | 29 ++++++- 8 files changed, 232 insertions(+), 42 deletions(-)
Ahmed Abd El Mawgood
2018-Jul-19  21:38 UTC
[PATCH 1/3] [RFC V3] KVM: X86: Memory ROE documentation
Following up with my previous threads on KVM assisted Anti rootkit protections. The current version doesn't address the attacks involving pages remapping. It is still design in progress, nevertheless, it will be in my later patch sets. Signed-off-by: Ahmed Abd El Mawgood <ahmedsoliman0x666 at gmail.com> --- Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt index a890529c63ed..a9db68adb7c9 100644 --- a/Documentation/virtual/kvm/hypercalls.txt +++ b/Documentation/virtual/kvm/hypercalls.txt @@ -121,3 +121,17 @@ compute the CLOCK_REALTIME for its clock, at the same instant. Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource, or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK. + +7. KVM_HC_HMROE +---------------- +Architecture: x86 +Status: active +Purpose: Hypercall used to apply Read-Only Enforcement to guest pages +Usage: + a0: start address of page that should be protected. + +This hypercall lets a guest kernel to have part of its read/write memory +converted into read-only. This action is irreversible. KVM_HC_HMROE can +not be triggered from guest Ring 3 (user mode). The reason is that user +mode malicious software can make use of it enforce read only protection on +an arbitrary memory page thus crashing the kernel. -- 2.16.4
Ahmed Abd El Mawgood
2018-Jul-19  21:38 UTC
[PATCH 2/3] [RFC V3] KVM: X86: Adding arbitrary data pointer in kvm memslot itterator functions
This will help sharing data into the slot_level_handler callback. In my
case I need to a share a counter for the pages traversed to use it in some
bitmap. Being able to send arbitrary memory pointer into the
slot_level_handler callback made it easy.
Signed-off-by: Ahmed Abd El Mawgood <ahmedsoliman0x666 at gmail.com>
---
 arch/x86/kvm/mmu.c | 65 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 37 insertions(+), 28 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d594690d8b95..77661530b2c4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1418,7 +1418,7 @@ static bool spte_write_protect(u64 *sptep, bool
pt_protect)
 
 static bool __rmap_write_protect(struct kvm *kvm,
 				 struct kvm_rmap_head *rmap_head,
-				 bool pt_protect)
+				 bool pt_protect, void *data)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1457,7 +1457,8 @@ static bool wrprot_ad_disabled_spte(u64 *sptep)
  *	- W bit on ad-disabled SPTEs.
  * Returns true iff any D or W bits were cleared.
  */
-static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head
*rmap_head)
+static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head
*rmap_head,
+				void *data)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1483,7 +1484,8 @@ static bool spte_set_dirty(u64 *sptep)
 	return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_set_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
+static bool __rmap_set_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+				void *data)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1515,7 +1517,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm
*kvm,
 	while (mask) {
 		rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
 					  PT_PAGE_TABLE_LEVEL, slot);
-		__rmap_write_protect(kvm, rmap_head, false);
+		__rmap_write_protect(kvm, rmap_head, false, NULL);
 
 		/* clear the first set bit */
 		mask &= mask - 1;
@@ -1541,7 +1543,7 @@ void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 	while (mask) {
 		rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
 					  PT_PAGE_TABLE_LEVEL, slot);
-		__rmap_clear_dirty(kvm, rmap_head);
+		__rmap_clear_dirty(kvm, rmap_head, NULL);
 
 		/* clear the first set bit */
 		mask &= mask - 1;
@@ -1594,7 +1596,8 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 
 	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
 		rmap_head = __gfn_to_rmap(gfn, i, slot);
-		write_protected |= __rmap_write_protect(kvm, rmap_head, true);
+		write_protected |= __rmap_write_protect(kvm, rmap_head, true,
+				NULL);
 	}
 
 	return write_protected;
@@ -1608,7 +1611,8 @@ static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64
gfn)
 	return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn);
 }
 
-static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
+static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+		void *data)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1628,7 +1632,7 @@ static int kvm_unmap_rmapp(struct kvm *kvm, struct
kvm_rmap_head *rmap_head,
 			   struct kvm_memory_slot *slot, gfn_t gfn, int level,
 			   unsigned long data)
 {
-	return kvm_zap_rmapp(kvm, rmap_head);
+	return kvm_zap_rmapp(kvm, rmap_head, NULL);
 }
 
 static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
@@ -5086,13 +5090,15 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
 }
 
 /* The return value indicates if tlb flush on all vcpus is needed. */
-typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head
*rmap_head);
+typedef bool (*slot_level_handler) (struct kvm *kvm,
+		struct kvm_rmap_head *rmap_head, void *data);
 
 /* The caller should hold mmu-lock before calling this function. */
 static __always_inline bool
 slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			slot_level_handler fn, int start_level, int end_level,
-			gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb)
+			gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb,
+			void *data)
 {
 	struct slot_rmap_walk_iterator iterator;
 	bool flush = false;
@@ -5100,7 +5106,7 @@ slot_handle_level_range(struct kvm *kvm, struct
kvm_memory_slot *memslot,
 	for_each_slot_rmap_range(memslot, start_level, end_level, start_gfn,
 			end_gfn, &iterator) {
 		if (iterator.rmap)
-			flush |= fn(kvm, iterator.rmap);
+			flush |= fn(kvm, iterator.rmap, data);
 
 		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
 			if (flush && lock_flush_tlb) {
@@ -5122,36 +5128,36 @@ slot_handle_level_range(struct kvm *kvm, struct
kvm_memory_slot *memslot,
 static __always_inline bool
 slot_handle_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		  slot_level_handler fn, int start_level, int end_level,
-		  bool lock_flush_tlb)
+		  bool lock_flush_tlb, void *data)
 {
 	return slot_handle_level_range(kvm, memslot, fn, start_level,
 			end_level, memslot->base_gfn,
 			memslot->base_gfn + memslot->npages - 1,
-			lock_flush_tlb);
+			lock_flush_tlb, data);
 }
 
 static __always_inline bool
 slot_handle_all_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
-		      slot_level_handler fn, bool lock_flush_tlb)
+		      slot_level_handler fn, bool lock_flush_tlb, void *data)
 {
 	return slot_handle_level(kvm, memslot, fn, PT_PAGE_TABLE_LEVEL,
-				 PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb);
+				 PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb, data);
 }
 
 static __always_inline bool
 slot_handle_large_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			slot_level_handler fn, bool lock_flush_tlb)
+			slot_level_handler fn, bool lock_flush_tlb, void *data)
 {
 	return slot_handle_level(kvm, memslot, fn, PT_PAGE_TABLE_LEVEL + 1,
-				 PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb);
+				 PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb, data);
 }
 
 static __always_inline bool
 slot_handle_leaf(struct kvm *kvm, struct kvm_memory_slot *memslot,
-		 slot_level_handler fn, bool lock_flush_tlb)
+		 slot_level_handler fn, bool lock_flush_tlb, void *data)
 {
 	return slot_handle_level(kvm, memslot, fn, PT_PAGE_TABLE_LEVEL,
-				 PT_PAGE_TABLE_LEVEL, lock_flush_tlb);
+				 PT_PAGE_TABLE_LEVEL, lock_flush_tlb, data);
 }
 
 void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
@@ -5173,7 +5179,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start,
gfn_t gfn_end)
 
 			slot_handle_level_range(kvm, memslot, kvm_zap_rmapp,
 						PT_PAGE_TABLE_LEVEL, PT_MAX_HUGEPAGE_LEVEL,
-						start, end - 1, true);
+						start, end - 1, true, NULL);
 		}
 	}
 
@@ -5181,9 +5187,10 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start,
gfn_t gfn_end)
 }
 
 static bool slot_rmap_write_protect(struct kvm *kvm,
-				    struct kvm_rmap_head *rmap_head)
+				    struct kvm_rmap_head *rmap_head,
+				    void *data)
 {
-	return __rmap_write_protect(kvm, rmap_head, false);
+	return __rmap_write_protect(kvm, rmap_head, false, data);
 }
 
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
@@ -5193,7 +5200,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 
 	spin_lock(&kvm->mmu_lock);
 	flush = slot_handle_all_level(kvm, memslot, slot_rmap_write_protect,
-				      false);
+				      false, NULL);
 	spin_unlock(&kvm->mmu_lock);
 
 	/*
@@ -5219,7 +5226,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 }
 
 static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
-					 struct kvm_rmap_head *rmap_head)
+					 struct kvm_rmap_head *rmap_head,
+					 void *data)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -5257,7 +5265,7 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 	/* FIXME: const-ify all uses of struct kvm_memory_slot.  */
 	spin_lock(&kvm->mmu_lock);
 	slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
-			 kvm_mmu_zap_collapsible_spte, true);
+			 kvm_mmu_zap_collapsible_spte, true, NULL);
 	spin_unlock(&kvm->mmu_lock);
 }
 
@@ -5267,7 +5275,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 	bool flush;
 
 	spin_lock(&kvm->mmu_lock);
-	flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
+	flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false, NULL);
 	spin_unlock(&kvm->mmu_lock);
 
 	lockdep_assert_held(&kvm->slots_lock);
@@ -5290,7 +5298,7 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm
*kvm,
 
 	spin_lock(&kvm->mmu_lock);
 	flush = slot_handle_large_level(kvm, memslot, slot_rmap_write_protect,
-					false);
+					false, NULL);
 	spin_unlock(&kvm->mmu_lock);
 
 	/* see kvm_mmu_slot_remove_write_access */
@@ -5307,7 +5315,8 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm,
 	bool flush;
 
 	spin_lock(&kvm->mmu_lock);
-	flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false);
+	flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false,
+			NULL);
 	spin_unlock(&kvm->mmu_lock);
 
 	lockdep_assert_held(&kvm->slots_lock);
-- 
2.16.4
Ahmed Abd El Mawgood
2018-Jul-19  21:38 UTC
[PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
This patch introduces a hypercall implemented for X86 that can assist
against subset of kernel rootkits, it works by place readonly protection in
shadow PTE. The end result protection is also kept in a bitmap for each
kvm_memory_slot and is used as reference when updating SPTEs. The whole
goal is to protect the guest kernel static data from modification if
attacker is running from guest ring 0, for this reason there is no
hypercall to revert effect of Memory ROE hypercall. This patch doesn't
implement integrity check on guest TLB so obvious attack on the current
implementation will involve guest virtual address -> guest physical
address remapping, but there are plans to fix that.
Signed-off-by: Ahmed Abd El Mawgood <ahmedsoliman0x666 at gmail.com>
---
 arch/x86/include/asm/kvm_host.h | 11 +++++-
 arch/x86/kvm/Kconfig            |  7 ++++
 arch/x86/kvm/mmu.c              | 72 ++++++++++++++++++++++++++++++------
 arch/x86/kvm/x86.c              | 82 +++++++++++++++++++++++++++++++++++++++--
 include/linux/kvm_host.h        |  3 ++
 include/uapi/linux/kvm_para.h   |  1 +
 virt/kvm/kvm_main.c             | 29 +++++++++++++--
 7 files changed, 186 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c13cd28d9d1b..128bcfa246a3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -236,6 +236,15 @@ struct kvm_mmu_memory_cache {
 	void *objects[KVM_NR_MEM_OBJS];
 };
 
+/*
+ * This is internal structure used to be be able to access kvm memory slot and
+ * have track of the number of current PTE when doing shadow PTE walk
+ */
+struct kvm_write_access_data {
+	int i;
+	struct kvm_memory_slot *memslot;
+};
+
 /*
  * the pages used as guest page table on soft mmu are tracked by
  * kvm_memory_slot.arch.gfn_track which is 16 bits, so the role bits used
@@ -1130,7 +1139,7 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64
accessed_mask,
 		u64 acc_track_mask, u64 me_mask);
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
+void kvm_mmu_slot_apply_write_access(struct kvm *kvm,
 				      struct kvm_memory_slot *memslot);
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 92fd433c50b9..8ae822a8dc7a 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,13 @@ config KVM_MMU_AUDIT
 	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 	 auditing of KVM MMU events at runtime.
 
+config KVM_MROE
+	bool "Hypercall Memory Read-Only Enforcement"
+	depends on KVM && X86
+	help
+	This option add KVM_HC_HMROE hypercall to kvm which as hardening
+	mechanism to protect memory pages from being edited.
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly
under
 # the virtualization menu.
 source drivers/vhost/Kconfig
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 77661530b2c4..4ce6a9a19a23 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1416,9 +1416,8 @@ static bool spte_write_protect(u64 *sptep, bool
pt_protect)
 	return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_write_protect(struct kvm *kvm,
-				 struct kvm_rmap_head *rmap_head,
-				 bool pt_protect, void *data)
+static bool __rmap_write_protection(struct kvm *kvm,
+		struct kvm_rmap_head *rmap_head, bool pt_protect)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1430,6 +1429,38 @@ static bool __rmap_write_protect(struct kvm *kvm,
 	return flush;
 }
 
+#ifdef CONFIG_KVM_MROE
+static bool __rmap_write_protect_mroe(struct kvm *kvm,
+		struct kvm_rmap_head *rmap_head,
+		bool pt_protect,
+		struct kvm_write_access_data *d)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	bool prot;
+	bool flush = false;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep) {
+		prot = !test_bit(d->i, d->memslot->mroe_bitmap) &&
pt_protect;
+		flush |= spte_write_protect(sptep, prot);
+		d->i++;
+	}
+	return flush;
+}
+#endif
+
+static bool __rmap_write_protect(struct kvm *kvm,
+		struct kvm_rmap_head *rmap_head,
+		bool pt_protect,
+		struct kvm_write_access_data *d)
+{
+#ifdef CONFIG_KVM_MROE
+	if (d != NULL)
+		return __rmap_write_protect_mroe(kvm, rmap_head, pt_protect, d);
+#endif
+	return __rmap_write_protection(kvm, rmap_head, pt_protect);
+}
+
 static bool spte_clear_dirty(u64 *sptep)
 {
 	u64 spte = *sptep;
@@ -1517,7 +1548,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm
*kvm,
 	while (mask) {
 		rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
 					  PT_PAGE_TABLE_LEVEL, slot);
-		__rmap_write_protect(kvm, rmap_head, false, NULL);
+		__rmap_write_protection(kvm, rmap_head, false);
 
 		/* clear the first set bit */
 		mask &= mask - 1;
@@ -1593,11 +1624,15 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	struct kvm_rmap_head *rmap_head;
 	int i;
 	bool write_protected = false;
+	struct kvm_write_access_data data = {
+		.i = 0,
+		.memslot = slot,
+	};
 
 	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
 		rmap_head = __gfn_to_rmap(gfn, i, slot);
 		write_protected |= __rmap_write_protect(kvm, rmap_head, true,
-				NULL);
+				&data);
 	}
 
 	return write_protected;
@@ -5190,21 +5225,36 @@ static bool slot_rmap_write_protect(struct kvm *kvm,
 				    struct kvm_rmap_head *rmap_head,
 				    void *data)
 {
-	return __rmap_write_protect(kvm, rmap_head, false, data);
+	return __rmap_write_protect(kvm, rmap_head, false,
+			(struct kvm_write_access_data *)data);
 }
 
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
+static bool slot_rmap_apply_protection(struct kvm *kvm,
+		struct kvm_rmap_head *rmap_head,
+		void *data)
+{
+	struct kvm_write_access_data *d = (struct kvm_write_access_data *) data;
+	bool prot_mask = !(d->memslot->flags & KVM_MEM_READONLY);
+
+	return __rmap_write_protect(kvm, rmap_head, prot_mask, d);
+}
+
+void kvm_mmu_slot_apply_write_access(struct kvm *kvm,
 				      struct kvm_memory_slot *memslot)
 {
 	bool flush;
+	struct kvm_write_access_data data = {
+		.i = 0,
+		.memslot = memslot,
+	};
 
 	spin_lock(&kvm->mmu_lock);
-	flush = slot_handle_all_level(kvm, memslot, slot_rmap_write_protect,
-				      false, NULL);
+	flush = slot_handle_all_level(kvm, memslot, slot_rmap_apply_protection,
+				      false, &data);
 	spin_unlock(&kvm->mmu_lock);
 
 	/*
-	 * kvm_mmu_slot_remove_write_access() and kvm_vm_ioctl_get_dirty_log()
+	 * kvm_mmu_slot_apply_write_access() and kvm_vm_ioctl_get_dirty_log()
 	 * which do tlb flush out of mmu-lock should be serialized by
 	 * kvm->slots_lock otherwise tlb flush would be missed.
 	 */
@@ -5301,7 +5351,7 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm
*kvm,
 					false, NULL);
 	spin_unlock(&kvm->mmu_lock);
 
-	/* see kvm_mmu_slot_remove_write_access */
+	/* see kvm_mmu_slot_apply_write_access*/
 	lockdep_assert_held(&kvm->slots_lock);
 
 	if (flush)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0046aa70205a..9addc46d75be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4177,7 +4177,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct
kvm_dirty_log *log)
 
 	/*
 	 * All the TLBs can be flushed out of mmu lock, see the comments in
-	 * kvm_mmu_slot_remove_write_access().
+	 * kvm_mmu_slot_apply_write_access().
 	 */
 	lockdep_assert_held(&kvm->slots_lock);
 	if (is_dirty)
@@ -6670,7 +6670,76 @@ static int kvm_pv_clock_pairing(struct kvm_vcpu *vcpu,
gpa_t paddr,
 }
 #endif
 
-/*
+#ifdef CONFIG_KVM_MROE
+static int __roe_protect_frame(struct kvm *kvm, gpa_t gpa)
+{
+	struct kvm_memory_slot *slot;
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!slot || gfn > slot->base_gfn + slot->npages)
+		return -EINVAL;
+	set_bit(gfn - slot->base_gfn, slot->mroe_bitmap);
+	kvm_mmu_slot_apply_write_access(kvm, slot);
+	kvm_arch_flush_shadow_memslot(kvm, slot);
+
+	return 0;
+}
+
+static int roe_protect_frame(struct kvm *kvm, gpa_t gpa)
+{
+	int r;
+
+	mutex_lock(&kvm->slots_lock);
+	r = __roe_protect_frame(kvm, gpa);
+	mutex_unlock(&kvm->slots_lock);
+	return r;
+}
+
+static bool kvm_mroe_userspace(struct kvm_vcpu *vcpu)
+{
+	u64 rflags;
+	u64 cr0 = kvm_read_cr0(vcpu);
+	u64 iopl;
+
+	// first checking we are not in protected mode
+	if ((cr0 & 1) == 0)
+		return false;
+	/*
+	 * we don't need to worry about comments in __get_regs
+	 * because we are sure that this function will only be
+	 * triggered at the end of a hypercall
+	 */
+	 rflags = kvm_get_rflags(vcpu);
+	iopl = (rflags >> 12) & 3;
+	if (iopl != 3)
+		return false;
+	return true;
+}
+
+static int kvm_mroe(struct kvm_vcpu *vcpu, u64 gva)
+{
+	struct kvm *kvm = vcpu->kvm;
+	gpa_t gpa;
+	u64 hva;
+
+	/*
+	 * First we need to maek sure that we are running from something that
+	 * isn't usermode
+	 */
+	if (kvm_mroe_userspace(vcpu))
+		return -1;//I don't really know what to return
+	if (gva & ~PAGE_MASK)
+		return -EINVAL;
+	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
+	hva = gfn_to_hva(kvm, gpa >> PAGE_SHIFT);
+	if (!access_ok(VERIFY_WRITE, hva, PAGE_SIZE))
+		return -EINVAL;
+	return roe_protect_frame(vcpu->kvm, gpa);
+}
+#endif
+
+ /*
  * kvm_pv_kick_cpu_op:  Kick a vcpu.
  *
  * @apicid - apicid of vcpu to be kicked.
@@ -6737,6 +6806,11 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	case KVM_HC_CLOCK_PAIRING:
 		ret = kvm_pv_clock_pairing(vcpu, a0, a1);
 		break;
+#endif
+#ifdef CONFIG_KVM_MROE
+	case KVM_HC_HMROE:
+		ret = kvm_mroe(vcpu, a0);
+		break;
 #endif
 	default:
 		ret = -KVM_ENOSYS;
@@ -8971,8 +9045,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 				     struct kvm_memory_slot *new)
 {
 	/* Still write protect RO slot */
+	kvm_mmu_slot_apply_write_access(kvm, new);
 	if (new->flags & KVM_MEM_READONLY) {
-		kvm_mmu_slot_remove_write_access(kvm, new);
 		return;
 	}
 
@@ -9010,7 +9084,7 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 		if (kvm_x86_ops->slot_enable_log_dirty)
 			kvm_x86_ops->slot_enable_log_dirty(kvm, new);
 		else
-			kvm_mmu_slot_remove_write_access(kvm, new);
+			kvm_mmu_slot_apply_write_access(kvm, new);
 	} else {
 		if (kvm_x86_ops->slot_disable_log_dirty)
 			kvm_x86_ops->slot_disable_log_dirty(kvm, new);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4ee7bc548a83..82c5780e11d9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -297,6 +297,9 @@ static inline int kvm_vcpu_exiting_guest_mode(struct
kvm_vcpu *vcpu)
 struct kvm_memory_slot {
 	gfn_t base_gfn;
 	unsigned long npages;
+#ifdef CONFIG_KVM_MROE
+	unsigned long *mroe_bitmap;
+#endif
 	unsigned long *dirty_bitmap;
 	struct kvm_arch_memory_slot arch;
 	unsigned long userspace_addr;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index dcf629dd2889..4e2badc09b5b 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -26,6 +26,7 @@
 #define KVM_HC_MIPS_EXIT_VM		7
 #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
 #define KVM_HC_CLOCK_PAIRING		9
+#define KVM_HC_HMROE			10
 
 /*
  * hypercalls use architecture specific
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b47507faab5..0f7141e4d550 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -794,6 +794,17 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot
*memslot)
 	return 0;
 }
 
+static int kvm_init_mroe_bitmap(struct kvm_memory_slot *slot)
+{
+#ifdef CONFIG_KVM_MROE
+	slot->mroe_bitmap = kvzalloc(BITS_TO_LONGS(slot->npages) *
+	sizeof(unsigned long), GFP_KERNEL);
+	if (!slot->mroe_bitmap)
+		return -ENOMEM;
+#endif
+	return 0;
+}
+
 /*
  * Insert memslot and re-sort memslots based on their GFN,
  * so binary search could be used to lookup GFN.
@@ -1011,6 +1022,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if (kvm_create_dirty_bitmap(&new) < 0)
 			goto out_free;
 	}
+	if (kvm_init_mroe_bitmap(&new) < 0)
+		goto out_free;
 
 	slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
 	if (!slots)
@@ -1264,13 +1277,23 @@ static bool memslot_is_readonly(struct kvm_memory_slot
*slot)
 	return slot->flags & KVM_MEM_READONLY;
 }
 
+static bool gfn_is_readonly(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+#ifdef CONFIG_KVM_MROE
+	return test_bit(gfn - slot->base_gfn, slot->mroe_bitmap) ||
+		memslot_is_readonly(slot);
+#else
+	return memslot_is_readonly(slot);
+#endif
+}
+
 static unsigned long __gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
 				       gfn_t *nr_pages, bool write)
 {
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return KVM_HVA_ERR_BAD;
 
-	if (memslot_is_readonly(slot) && write)
+	if (gfn_is_readonly(slot, gfn) && write)
 		return KVM_HVA_ERR_RO_BAD;
 
 	if (nr_pages)
@@ -1314,7 +1337,7 @@ unsigned long gfn_to_hva_memslot_prot(struct
kvm_memory_slot *slot,
 	unsigned long hva = __gfn_to_hva_many(slot, gfn, NULL, false);
 
 	if (!kvm_is_error_hva(hva) && writable)
-		*writable = !memslot_is_readonly(slot);
+		*writable = !gfn_is_readonly(slot, gfn);
 
 	return hva;
 }
@@ -1554,7 +1577,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot
*slot, gfn_t gfn,
 	}
 
 	/* Do not map writable pfn in the readonly memslot. */
-	if (writable && memslot_is_readonly(slot)) {
+	if (writable && gfn_is_readonly(slot, gfn)) {
 		*writable = false;
 		writable = NULL;
 	}
-- 
2.16.4
Ahmed Soliman
2018-Jul-20  00:26 UTC
[PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
On 20 July 2018 at 00:59, Jann Horn <jannh at google.com> wrote:> On Thu, Jul 19, 2018 at 11:40 PM Ahmed Abd El Mawgood> Why are you implementing this in the kernel, instead of doing it in > host userspace?I thought about implementing it completely in QEMU but It won't be possible for few reasons: - After talking to QEMU folks I came up to conclusion that it when it comes to managing memory allocated for guest, it is always better to let KVM handles everything, unless there is a good reason to play with that memory chunk inside QEMU itself. - But actually there is a good reason for implementing ROE in kernel space, it is that ROE is architecture dependent to great extent. I should have emphasized that the only currently supported architecture is X86. I am not sure how deep the dependency on architecture goes. But as for now the current set of patches does a SPTE enumeration as part of the process. To my best knowledge, this isn't exposed outside arch/x68/kvm let alone having a host user space interface for it. Also the way I am planning to protect TLB from malicious gva -> gpa mapping is by knowing that in x86 it is possible to VMEXIT on page faults, I am not sure if it will safe to assume that all kvm supported architectures will behave this way. For these reasons I thought it will be better if arch dependent stuff (the mechanism implementation) is kept in arch/*/kvm folder and with minimal modifications to virt/kvm/* after setting a kconfig variable to enable ROE. But I left room for the user space app using kvm to decide the rightful policy for handling ROE violations. The way it works by KVM_EXIT_MMIO error to user space, keeping all the architectural details hidden away from user space. A last note is that I didn't create this from scratch, instead I extended KVM_MEM_READONLY implementation to also allow R/O per page instead R/O per whole slot which is already done in kernel space.
Randy Dunlap
2018-Jul-20  01:07 UTC
[PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
On 07/19/2018 02:38 PM, Ahmed Abd El Mawgood wrote:> This patch introduces a hypercall implemented for X86 that can assist > against subset of kernel rootkits, it works by place readonly protection in > shadow PTE. The end result protection is also kept in a bitmap for each > kvm_memory_slot and is used as reference when updating SPTEs. The whole > goal is to protect the guest kernel static data from modification if > attacker is running from guest ring 0, for this reason there is no > hypercall to revert effect of Memory ROE hypercall. This patch doesn't > implement integrity check on guest TLB so obvious attack on the current > implementation will involve guest virtual address -> guest physical > address remapping, but there are plans to fix that. > > Signed-off-by: Ahmed Abd El Mawgood <ahmedsoliman0x666 at gmail.com> > ---> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index 92fd433c50b9..8ae822a8dc7a 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -96,6 +96,13 @@ config KVM_MMU_AUDIT > This option adds a R/W kVM module parameter 'mmu_audit', which allows > auditing of KVM MMU events at runtime. > > +config KVM_MROE > + bool "Hypercall Memory Read-Only Enforcement" > + depends on KVM && X86 > + help > + This option add KVM_HC_HMROE hypercall to kvm which as hardeningadds to kvm as a hardening (???)> + mechanism to protect memory pages from being edited. > + > # OK, it's a little counter-intuitive to do this, but it puts it neatly under > # the virtualization menu. > source drivers/vhost/Kconfig-- ~Randy
Randy Dunlap
2018-Jul-20  01:11 UTC
[PATCH 1/3] [RFC V3] KVM: X86: Memory ROE documentation
On 07/19/2018 02:38 PM, Ahmed Abd El Mawgood wrote:> Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt > index a890529c63ed..a9db68adb7c9 100644 > --- a/Documentation/virtual/kvm/hypercalls.txt > +++ b/Documentation/virtual/kvm/hypercalls.txt > @@ -121,3 +121,17 @@ compute the CLOCK_REALTIME for its clock, at the same instant. > > Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource, > or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK. > + > +7. KVM_HC_HMROE > +---------------- > +Architecture: x86 > +Status: active > +Purpose: Hypercall used to apply Read-Only Enforcement to guest pages > +Usage: > + a0: start address of page that should be protected.Is this done one page per call? No grouping, no multiple pages?> + > +This hypercall lets a guest kernel to have part of its read/write memorylets a guest kernel have part of> +converted into read-only. This action is irreversible. KVM_HC_HMROE can > +not be triggered from guest Ring 3 (user mode). The reason is that user > +mode malicious software can make use of it enforce read only protection onmake use of it to enforce> +an arbitrary memory page thus crashing the kernel. >-- ~Randy
Konrad Rzeszutek Wilk
2018-Jul-20  02:45 UTC
Memory Read Only Enforcement: VMM assisted kernel rootkit mitigation for KVM
On Thu, Jul 19, 2018 at 11:37:59PM +0200, Ahmed Abd El Mawgood wrote:> Hi, > > This is my first set of patches that works as I would expect, and the > third revision I sent to mailing lists. > > Following up with my previous discussions about kernel rootkit mitigation > via placing R/O protection on critical data structure, static data, > privileged registers with static content. These patches present the > first part where it is only possible to place these protections on > memory pages. Feature-wise, this set of patches is incomplete in the sense of: > - They still don't protect privileged registers > - They don't protect guest TLB from malicious gva -> gpa page mappings. > But they provide sketches for a basic working design. Note that I am totally > noob and it took lots of time and effort to get to this point. So sorry in > advance if I overlooked something.This reminds me of Xen PV page model. That is the hypervisor is the one auditing the page tables and the guest's pages are read-only. Ditto for IDT, GDT, etc. Gosh, did you by chance look at how Xen PV mechanism is done? It may provide the protection you are looking for? CC-ing xen-devel.> > [PATCH 1/3] [RFC V3] KVM: X86: Memory ROE documentation > [PATCH 2/3] [RFC V3] KVM: X86: Adding arbitrary data pointer in kvm memslot itterator functions > [PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE > > Summery: > > Documentation/virtual/kvm/hypercalls.txt | 14 ++++ > arch/x86/include/asm/kvm_host.h | 11 ++- > arch/x86/kvm/Kconfig | 7 ++ > arch/x86/kvm/mmu.c | 127 ++++++++++++++++++++++--------- > arch/x86/kvm/x86.c | 82 +++++++++++++++++++- > include/linux/kvm_host.h | 3 + > include/uapi/linux/kvm_para.h | 1 + > virt/kvm/kvm_main.c | 29 ++++++- > 8 files changed, 232 insertions(+), 42 deletions(-) >
Ahmed Soliman
2018-Jul-20  14:44 UTC
[PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
On 20 July 2018 at 03:28, Jann Horn <jannh at google.com> wrote:> On Fri, Jul 20, 2018 at 2:26 AM Ahmed Soliman > <ahmedsoliman0x666 at gmail.com> wrote: >> >> On 20 July 2018 at 00:59, Jann Horn <jannh at google.com> wrote: >> > On Thu, Jul 19, 2018 at 11:40 PM Ahmed Abd El Mawgood >> >> > Why are you implementing this in the kernel, instead of doing it in >> > host userspace? >> >> I thought about implementing it completely in QEMU but It won't be >> possible for few reasons: >> >> - After talking to QEMU folks I came up to conclusion that it when it >> comes to managing memory allocated for guest, it is always better to let >> KVM handles everything, unless there is a good reason to play with that >> memory chunk inside QEMU itself. > > Why? It seems to me like it'd be easier to add a way to mprotect() > guest pages to readonly via virtio or whatever in QEMU than to add > kernel code?I did an early prototype with mprotect(), But then mprotect() didn't do exactly what I wanted, The goal here is to prevent the guest from writing to protected page but allow the host to do if it ever needs to at the same time. mprotect() will either allow both host and guest, or prevent both host and guest. Even though I can not come up with a use case where one might need to allow host to read/write to a page but prevent guest from writing to that page, I think that it is a limitation that will cost complete redesign if it proves that this kind of behavior is undesired. Also mprotect is kind of inflexible. Writing to mprotected pages would immediately trigger SIGSEGV and then userspace process will have to handle that fault in order to control the situation. That sounded to me more like a little hack than a solid design.> And if you ever want to support VM snapshotting/resumption, you'll > need support for restoring the protection flags from QEMU anyway.I never thought about that, but thanks for letting me know. I will keep that in my TODO list.>> - But actually there is a good reason for implementing ROE in kernel space, >> it is that ROE is architecture dependent to great extent. > > How so? The host component just has to make pages in guest memory > readonly, right? As far as I can tell, from QEMU, it'd more or less be > a matter of calling mprotect() a few times? (Plus potentially some > hooks to prevent other virtio code from crashing by attempting to > access protected pages - but you'd need that anyway, no matter where > the protection for the guest is enforced.)I don't think that virtio would crash that way, because host should be able write to memory as it wants. but yet I see where there is this going, probably I can add hooks so that virtio can respect the read only flags.>> I should have >> emphasized that the only currently supported architecture is X86. I am >> not sure how deep the dependency on architecture goes. But as for now >> the current set of patches does a SPTE enumeration as part of the process. >> To my best knowledge, this isn't exposed outside arch/x68/kvm let alone >> having a host user space interface for it. Also the way I am planning to >> protect TLB from malicious gva -> gpa mapping is by knowing that in x86 >> it is possible to VMEXIT on page faults, I am not sure if it will safe to >> assume that all kvm supported architectures will behave this way. > > You mean EPT faults, right? If so: I think all architectures have to > support that - there are already other reasons why random guest memory > accesses can fault. In particular, the host can page out guest memory. > I think that's the case on all architectures?Here my lack of full knowledge kicks in, I am not sure whether is EPT fault or guest pf is what I want to capture validate. I think X86 can vm exit on both. Due to nature of ROE, guest user space code can not have ROE because it is irreversible, so it will be safe to assume that only pages that are not swappable are the one's I would care about. still lots of the details are blurry for me. But what I was trying to say is that there is always differences based on architecture that is why it will be better to do things in kernel module if we decided not to use mprotect method.>> For these reasons I thought it will be better if arch dependent stuff (the >> mechanism implementation) is kept in arch/*/kvm folder and with minimal >> modifications to virt/kvm/* after setting a kconfig variable to enable ROE. >> But I left room for the user space app using kvm to decide the rightful policy >> for handling ROE violations. The way it works by KVM_EXIT_MMIO error to user >> space, keeping all the architectural details hidden away from user space. >> >> A last note is that I didn't create this from scratch, instead I extended >> KVM_MEM_READONLY implementation to also allow R/O per page instead >> R/O per whole slot which is already done in kernel space. > > But then you still have to also do something about virtio code in QEMU > that might write to those pages, right?Probably yes, still I haven't fully planned that yet. But I was thinking about if I can make use of IOMMU protection for DMA and have something similar for emulated devices backed by virtio.
Seemingly Similar Threads
- [PATCH RFC V4 3/3] KVM: X86: Adding skeleton for Memory ROE
- Memory Read Only Enforcement: VMM assisted kernel rootkit mitigation for KVM V4
- Memory Read Only Enforcement: VMM assisted kernel rootkit mitigation for KVM
- Memory Read Only Enforcement: VMM assisted kernel rootkit mitigation for KVM
- [PATCH v9 34/84] KVM: x86: page_track: add support for preread, prewrite and preexec