thr3ads.net - Xen devel - [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Huang2, Wei

2009-Mar-18 15:48 UTC

[Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Current Xen supports 2MB super pages for NPT/EPT. The attached patches
extend this feature to support 1GB pages. The PoD (populate-on-demand)
introduced by George Dunlap made P2M modification harder. I tried to
preserve existing PoD design by introducing a 1GB PoD cache list. 

 

Note that 1GB PoD can be dropped if we don''t care about 1GB when PoD is
enabled. In this case, we can just split 1GB PDPE into 512x2MB PDE
entries and grab pages from PoD super list. That can pretty much make
1gb_p2m_pod.patch go away.

 

Any comment/suggestion on design idea will be appreciated.

 

Thanks,

 

-Wei

 

 

The following is the description:

=== 1gb_tools.patch ==
Extend existing setup_guest() function. Basically, it tries to allocate
1GB pages whenever available. If this request fails, it falls back to
2MB. If both fail, then 4KB pages will be used.

 

=== 1gb_p2m.patch ==
* p2m_next_level()

Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we split
1GB into 512 2MB pages.

 

* p2m_set_entry()

Configure the PSE bit of L3 P2M table if page order == 18 (1GB).

 

* p2m_gfn_to_mfn()

Add support for 1GB case when doing gfn to mfn translation. When L3
entry is marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
Otherwise, we do the regular address translation (gfn ==> mfn).

 

* p2m_gfn_to_mfn_current()

This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
POPULATE_ON_DEMAND, it demands a populate using
p2m_pod_demand_populate(). Otherwise, it does a normal translation. 1GB
page is taken into consideration.

 

* set_p2m_entry()

Request 1GB page

 

* audit_p2m()

Support 1GB while auditing p2m table.

 

* p2m_change_type_global()

Deal with 1GB page when changing global page type. 

 

=== 1gb_p2m_pod.patch ==
* xen/include/asm-x86/p2m.h

Minor change to deal with PoD. It separates super page cache list into
2MB and 1GB lists. Similarly, we record last gpfn of sweeping for both
2MB and 1GB.

 

* p2m_pod_cache_add()

Check page order and add 1GB super page into PoD 1GB cache list.

 

* p2m_pod_cache_get()

Grab a page from cache list. It tries to break 1GB page into 512 2MB
pages if 2MB PoD list is empty. Similarly, 4KB can be requested from
super pages. The breaking order is 2MB then 1GB.

 

* p2m_pod_cache_target()

This function is used to set PoD cache size. To increase PoD target, we
try to allocate 1GB from xen domheap. If this fails, we try 2MB. If both
fail, we try 4KB which is guaranteed to work.

 

To decrease the target, we use a similar approach. We first try to free
1GB pages from 1GB PoD cache list. If such request fails, we try 2MB PoD
cache list. If both fail, we try 4KB list.

 

* p2m_pod_zero_check_superpage_1gb()

This adds a new function to check for 1GB page. This function is similar
to p2m_pod_zero_check_superpage_2mb().

 

* p2m_pod_zero_check_superpage_1gb()

We add a new function to sweep 1GB page from guest memory. This is the
same as p2m_pod_zero_check_superpage_2mb().

 

* p2m_pod_demand_populate()

The trick of this function is to do remap_and_retry if
p2m_pod_cache_get() fails. When p2m_pod_get() fails, this function will
splits p2m table entry into smaller ones (e.g. 1GB ==> 2MB or 2MB ==>
4KB). That can guarantee populate demands always work.

 

 






_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2009-Mar-18 17:20 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Thanks for doing this work, Wei -- especially all the extra effort for
the PoD integration.

One question: How well would you say you''ve tested the PoD
functionality?  Or to put it the other way, how much do I need to
prioritize testing this before the 3.4 release?

It wouldn''t be a bad idea to do as you suggested, and break things
into 2 meg pages for the PoD case.  In order to take the best
advantage of this in a PoD scenario, you''d need to have a balloon
driver that could allocate 1G of continuous *guest* p2m space, which
seems a bit optimistic at this point...

 -George

2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:> Current Xen supports 2MB super pages for NPT/EPT. The attached patches
> extend this feature to support 1GB pages. The PoD (populate-on-demand)
> introduced by George Dunlap made P2M modification harder. I tried to
> preserve existing PoD design by introducing a 1GB PoD cache list.
>
>
>
> Note that 1GB PoD can be dropped if we don''t care about 1GB when
PoD is
> enabled. In this case, we can just split 1GB PDPE into 512x2MB PDE entries
> and grab pages from PoD super list. That can pretty much make
> 1gb_p2m_pod.patch go away.
>
>
>
> Any comment/suggestion on design idea will be appreciated.
>
>
>
> Thanks,
>
>
>
> -Wei
>
>
>
>
>
> The following is the description:
>
> === 1gb_tools.patch ==>
> Extend existing setup_guest() function. Basically, it tries to allocate 1GB
> pages whenever available. If this request fails, it falls back to 2MB. If
> both fail, then 4KB pages will be used.
>
>
>
> === 1gb_p2m.patch ==>
> * p2m_next_level()
>
> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we split 1GB
> into 512 2MB pages.
>
>
>
> * p2m_set_entry()
>
> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
>
>
>
> * p2m_gfn_to_mfn()
>
> Add support for 1GB case when doing gfn to mfn translation. When L3 entry
is
> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate(). Otherwise,
> we do the regular address translation (gfn ==> mfn).
>
>
>
> * p2m_gfn_to_mfn_current()
>
> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
> POPULATE_ON_DEMAND, it demands a populate using p2m_pod_demand_populate().
> Otherwise, it does a normal translation. 1GB page is taken into
> consideration.
>
>
>
> * set_p2m_entry()
>
> Request 1GB page
>
>
>
> * audit_p2m()
>
> Support 1GB while auditing p2m table.
>
>
>
> * p2m_change_type_global()
>
> Deal with 1GB page when changing global page type.
>
>
>
> === 1gb_p2m_pod.patch ==>
> * xen/include/asm-x86/p2m.h
>
> Minor change to deal with PoD. It separates super page cache list into 2MB
> and 1GB lists. Similarly, we record last gpfn of sweeping for both 2MB and
> 1GB.
>
>
>
> * p2m_pod_cache_add()
>
> Check page order and add 1GB super page into PoD 1GB cache list.
>
>
>
> * p2m_pod_cache_get()
>
> Grab a page from cache list. It tries to break 1GB page into 512 2MB pages
> if 2MB PoD list is empty. Similarly, 4KB can be requested from super pages.
> The breaking order is 2MB then 1GB.
>
>
>
> * p2m_pod_cache_target()
>
> This function is used to set PoD cache size. To increase PoD target, we try
> to allocate 1GB from xen domheap. If this fails, we try 2MB. If both fail,
> we try 4KB which is guaranteed to work.
>
>
>
> To decrease the target, we use a similar approach. We first try to free 1GB
> pages from 1GB PoD cache list. If such request fails, we try 2MB PoD cache
> list. If both fail, we try 4KB list.
>
>
>
> * p2m_pod_zero_check_superpage_1gb()
>
> This adds a new function to check for 1GB page. This function is similar to
> p2m_pod_zero_check_superpage_2mb().
>
>
>
> * p2m_pod_zero_check_superpage_1gb()
>
> We add a new function to sweep 1GB page from guest memory. This is the same
> as p2m_pod_zero_check_superpage_2mb().
>
>
>
> * p2m_pod_demand_populate()
>
> The trick of this function is to do remap_and_retry if p2m_pod_cache_get()
> fails. When p2m_pod_get() fails, this function will splits p2m table entry
> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
guarantee
> populate demands always work.
>
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Mar-18 17:32 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

I''m not sure about putting this in for 3.4 unless there''s a
significant
performance win.

 -- Keir

On 18/03/2009 17:20, "George Dunlap"
<George.Dunlap@eu.citrix.com> wrote:
> Thanks for doing this work, Wei -- especially all the extra effort for
> the PoD integration.
> 
> One question: How well would you say you''ve tested the PoD
> functionality?  Or to put it the other way, how much do I need to
> prioritize testing this before the 3.4 release?
> 
> It wouldn''t be a bad idea to do as you suggested, and break things
> into 2 meg pages for the PoD case.  In order to take the best
> advantage of this in a PoD scenario, you''d need to have a balloon
> driver that could allocate 1G of continuous *guest* p2m space, which
> seems a bit optimistic at this point...
> 
>  -George
> 
> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>> Current Xen supports 2MB super pages for NPT/EPT. The attached patches
>> extend this feature to support 1GB pages. The PoD (populate-on-demand)
>> introduced by George Dunlap made P2M modification harder. I tried to
>> preserve existing PoD design by introducing a 1GB PoD cache list.
>> 
>> 
>> 
>> Note that 1GB PoD can be dropped if we don''t care about 1GB
when PoD is
>> enabled. In this case, we can just split 1GB PDPE into 512x2MB PDE
entries
>> and grab pages from PoD super list. That can pretty much make
>> 1gb_p2m_pod.patch go away.
>> 
>> 
>> 
>> Any comment/suggestion on design idea will be appreciated.
>> 
>> 
>> 
>> Thanks,
>> 
>> 
>> 
>> -Wei
>> 
>> 
>> 
>> 
>> 
>> The following is the description:
>> 
>> === 1gb_tools.patch ==>> 
>> Extend existing setup_guest() function. Basically, it tries to allocate
1GB
>> pages whenever available. If this request fails, it falls back to 2MB.
If
>> both fail, then 4KB pages will be used.
>> 
>> 
>> 
>> === 1gb_p2m.patch ==>> 
>> * p2m_next_level()
>> 
>> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we split
1GB
>> into 512 2MB pages.
>> 
>> 
>> 
>> * p2m_set_entry()
>> 
>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
>> 
>> 
>> 
>> * p2m_gfn_to_mfn()
>> 
>> Add support for 1GB case when doing gfn to mfn translation. When L3
entry is
>> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
Otherwise,
>> we do the regular address translation (gfn ==> mfn).
>> 
>> 
>> 
>> * p2m_gfn_to_mfn_current()
>> 
>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
>> POPULATE_ON_DEMAND, it demands a populate using
p2m_pod_demand_populate().
>> Otherwise, it does a normal translation. 1GB page is taken into
>> consideration.
>> 
>> 
>> 
>> * set_p2m_entry()
>> 
>> Request 1GB page
>> 
>> 
>> 
>> * audit_p2m()
>> 
>> Support 1GB while auditing p2m table.
>> 
>> 
>> 
>> * p2m_change_type_global()
>> 
>> Deal with 1GB page when changing global page type.
>> 
>> 
>> 
>> === 1gb_p2m_pod.patch ==>> 
>> * xen/include/asm-x86/p2m.h
>> 
>> Minor change to deal with PoD. It separates super page cache list into
2MB
>> and 1GB lists. Similarly, we record last gpfn of sweeping for both 2MB
and
>> 1GB.
>> 
>> 
>> 
>> * p2m_pod_cache_add()
>> 
>> Check page order and add 1GB super page into PoD 1GB cache list.
>> 
>> 
>> 
>> * p2m_pod_cache_get()
>> 
>> Grab a page from cache list. It tries to break 1GB page into 512 2MB
pages
>> if 2MB PoD list is empty. Similarly, 4KB can be requested from super
pages.
>> The breaking order is 2MB then 1GB.
>> 
>> 
>> 
>> * p2m_pod_cache_target()
>> 
>> This function is used to set PoD cache size. To increase PoD target, we
try
>> to allocate 1GB from xen domheap. If this fails, we try 2MB. If both
fail,
>> we try 4KB which is guaranteed to work.
>> 
>> 
>> 
>> To decrease the target, we use a similar approach. We first try to free
1GB
>> pages from 1GB PoD cache list. If such request fails, we try 2MB PoD
cache
>> list. If both fail, we try 4KB list.
>> 
>> 
>> 
>> * p2m_pod_zero_check_superpage_1gb()
>> 
>> This adds a new function to check for 1GB page. This function is
similar to
>> p2m_pod_zero_check_superpage_2mb().
>> 
>> 
>> 
>> * p2m_pod_zero_check_superpage_1gb()
>> 
>> We add a new function to sweep 1GB page from guest memory. This is the
same
>> as p2m_pod_zero_check_superpage_2mb().
>> 
>> 
>> 
>> * p2m_pod_demand_populate()
>> 
>> The trick of this function is to do remap_and_retry if
p2m_pod_cache_get()
>> fails. When p2m_pod_get() fails, this function will splits p2m table
entry
>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
guarantee
>> populate demands always work.
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>> 
>> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Huang2, Wei

2009-Mar-18 17:37 UTC

head link

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

George,

Thanks for your comments. I tested the first two parts (tools and
1gb_p2m). They are relatively straightforward in my opinion. As for the
third one (PoD), I just started the testing. So it still needs quit a
lot of testing efforts. 

If you feel the intermediate approach (tools + 1gb_p2m) is more
appealing, I will submit another patch today.

-Wei



-----Original Message-----
From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
Dunlap
Sent: Wednesday, March 18, 2009 12:20 PM
To: Huang2, Wei
Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; Tim Deegan
Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Thanks for doing this work, Wei -- especially all the extra effort for
the PoD integration.

One question: How well would you say you''ve tested the PoD
functionality?  Or to put it the other way, how much do I need to
prioritize testing this before the 3.4 release?

It wouldn''t be a bad idea to do as you suggested, and break things
into 2 meg pages for the PoD case.  In order to take the best
advantage of this in a PoD scenario, you''d need to have a balloon
driver that could allocate 1G of continuous *guest* p2m space, which
seems a bit optimistic at this point...

 -George

2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:> Current Xen supports 2MB super pages for NPT/EPT. The attached patches
> extend this feature to support 1GB pages. The PoD (populate-on-demand)
> introduced by George Dunlap made P2M modification harder. I tried to
> preserve existing PoD design by introducing a 1GB PoD cache list.
>
>
>
> Note that 1GB PoD can be dropped if we don''t care about 1GB when
PoD
is> enabled. In this case, we can just split 1GB PDPE into 512x2MB PDE
entries> and grab pages from PoD super list. That can pretty much make
> 1gb_p2m_pod.patch go away.
>
>
>
> Any comment/suggestion on design idea will be appreciated.
>
>
>
> Thanks,
>
>
>
> -Wei
>
>
>
>
>
> The following is the description:
>
> === 1gb_tools.patch ==>
> Extend existing setup_guest() function. Basically, it tries to
allocate 1GB> pages whenever available. If this request fails, it falls back to 2MB.
If> both fail, then 4KB pages will be used.
>
>
>
> === 1gb_p2m.patch ==>
> * p2m_next_level()
>
> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we
split 1GB> into 512 2MB pages.
>
>
>
> * p2m_set_entry()
>
> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
>
>
>
> * p2m_gfn_to_mfn()
>
> Add support for 1GB case when doing gfn to mfn translation. When L3
entry is> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
Otherwise,> we do the regular address translation (gfn ==> mfn).
>
>
>
> * p2m_gfn_to_mfn_current()
>
> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
> POPULATE_ON_DEMAND, it demands a populate using
p2m_pod_demand_populate().> Otherwise, it does a normal translation. 1GB page is taken into
> consideration.
>
>
>
> * set_p2m_entry()
>
> Request 1GB page
>
>
>
> * audit_p2m()
>
> Support 1GB while auditing p2m table.
>
>
>
> * p2m_change_type_global()
>
> Deal with 1GB page when changing global page type.
>
>
>
> === 1gb_p2m_pod.patch ==>
> * xen/include/asm-x86/p2m.h
>
> Minor change to deal with PoD. It separates super page cache list into
2MB> and 1GB lists. Similarly, we record last gpfn of sweeping for both 2MB
and> 1GB.
>
>
>
> * p2m_pod_cache_add()
>
> Check page order and add 1GB super page into PoD 1GB cache list.
>
>
>
> * p2m_pod_cache_get()
>
> Grab a page from cache list. It tries to break 1GB page into 512 2MB
pages> if 2MB PoD list is empty. Similarly, 4KB can be requested from super
pages.> The breaking order is 2MB then 1GB.
>
>
>
> * p2m_pod_cache_target()
>
> This function is used to set PoD cache size. To increase PoD target,
we try> to allocate 1GB from xen domheap. If this fails, we try 2MB. If both
fail,> we try 4KB which is guaranteed to work.
>
>
>
> To decrease the target, we use a similar approach. We first try to
free 1GB> pages from 1GB PoD cache list. If such request fails, we try 2MB PoD
cache> list. If both fail, we try 4KB list.
>
>
>
> * p2m_pod_zero_check_superpage_1gb()
>
> This adds a new function to check for 1GB page. This function is
similar to> p2m_pod_zero_check_superpage_2mb().
>
>
>
> * p2m_pod_zero_check_superpage_1gb()
>
> We add a new function to sweep 1GB page from guest memory. This is the
same> as p2m_pod_zero_check_superpage_2mb().
>
>
>
> * p2m_pod_demand_populate()
>
> The trick of this function is to do remap_and_retry if
p2m_pod_cache_get()> fails. When p2m_pod_get() fails, this function will splits p2m table
entry> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
guarantee> populate demands always work.
>
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Huang2, Wei

2009-Mar-18 17:45 UTC

head link

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Keir,

Would you consider the middle approach (tools + normal p2m code) for
3.4? I understand that 1GB PoD is too big. But the middle one is much
simpler.

Thanks,

-Wei

-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
Sent: Wednesday, March 18, 2009 12:33 PM
To: George Dunlap; Huang2, Wei
Cc: xen-devel@lists.xensource.com; Tim Deegan
Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

I''m not sure about putting this in for 3.4 unless there''s a
significant
performance win.

 -- Keir

On 18/03/2009 17:20, "George Dunlap"
<George.Dunlap@eu.citrix.com>
wrote:
> Thanks for doing this work, Wei -- especially all the extra effort for
> the PoD integration.
> 
> One question: How well would you say you''ve tested the PoD
> functionality?  Or to put it the other way, how much do I need to
> prioritize testing this before the 3.4 release?
> 
> It wouldn''t be a bad idea to do as you suggested, and break things
> into 2 meg pages for the PoD case.  In order to take the best
> advantage of this in a PoD scenario, you''d need to have a balloon
> driver that could allocate 1G of continuous *guest* p2m space, which
> seems a bit optimistic at this point...
> 
>  -George
> 
> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>> Current Xen supports 2MB super pages for NPT/EPT. The attached
patches>> extend this feature to support 1GB pages. The PoD
(populate-on-demand)>> introduced by George Dunlap made P2M modification harder. I tried to
>> preserve existing PoD design by introducing a 1GB PoD cache list.
>> 
>> 
>> 
>> Note that 1GB PoD can be dropped if we don''t care about 1GB
when PoD
is>> enabled. In this case, we can just split 1GB PDPE into 512x2MB PDE
entries>> and grab pages from PoD super list. That can pretty much make
>> 1gb_p2m_pod.patch go away.
>> 
>> 
>> 
>> Any comment/suggestion on design idea will be appreciated.
>> 
>> 
>> 
>> Thanks,
>> 
>> 
>> 
>> -Wei
>> 
>> 
>> 
>> 
>> 
>> The following is the description:
>> 
>> === 1gb_tools.patch ==>> 
>> Extend existing setup_guest() function. Basically, it tries to
allocate 1GB>> pages whenever available. If this request fails, it falls back to
2MB. If>> both fail, then 4KB pages will be used.
>> 
>> 
>> 
>> === 1gb_p2m.patch ==>> 
>> * p2m_next_level()
>> 
>> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we
split 1GB>> into 512 2MB pages.
>> 
>> 
>> 
>> * p2m_set_entry()
>> 
>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
>> 
>> 
>> 
>> * p2m_gfn_to_mfn()
>> 
>> Add support for 1GB case when doing gfn to mfn translation. When L3
entry is>> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
Otherwise,>> we do the regular address translation (gfn ==> mfn).
>> 
>> 
>> 
>> * p2m_gfn_to_mfn_current()
>> 
>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
>> POPULATE_ON_DEMAND, it demands a populate using
p2m_pod_demand_populate().>> Otherwise, it does a normal translation. 1GB page is taken into
>> consideration.
>> 
>> 
>> 
>> * set_p2m_entry()
>> 
>> Request 1GB page
>> 
>> 
>> 
>> * audit_p2m()
>> 
>> Support 1GB while auditing p2m table.
>> 
>> 
>> 
>> * p2m_change_type_global()
>> 
>> Deal with 1GB page when changing global page type.
>> 
>> 
>> 
>> === 1gb_p2m_pod.patch ==>> 
>> * xen/include/asm-x86/p2m.h
>> 
>> Minor change to deal with PoD. It separates super page cache list
into 2MB>> and 1GB lists. Similarly, we record last gpfn of sweeping for both
2MB and>> 1GB.
>> 
>> 
>> 
>> * p2m_pod_cache_add()
>> 
>> Check page order and add 1GB super page into PoD 1GB cache list.
>> 
>> 
>> 
>> * p2m_pod_cache_get()
>> 
>> Grab a page from cache list. It tries to break 1GB page into 512 2MB
pages>> if 2MB PoD list is empty. Similarly, 4KB can be requested from super
pages.>> The breaking order is 2MB then 1GB.
>> 
>> 
>> 
>> * p2m_pod_cache_target()
>> 
>> This function is used to set PoD cache size. To increase PoD target,
we try>> to allocate 1GB from xen domheap. If this fails, we try 2MB. If both
fail,>> we try 4KB which is guaranteed to work.
>> 
>> 
>> 
>> To decrease the target, we use a similar approach. We first try to
free 1GB>> pages from 1GB PoD cache list. If such request fails, we try 2MB PoD
cache>> list. If both fail, we try 4KB list.
>> 
>> 
>> 
>> * p2m_pod_zero_check_superpage_1gb()
>> 
>> This adds a new function to check for 1GB page. This function is
similar to>> p2m_pod_zero_check_superpage_2mb().
>> 
>> 
>> 
>> * p2m_pod_zero_check_superpage_1gb()
>> 
>> We add a new function to sweep 1GB page from guest memory. This is
the same>> as p2m_pod_zero_check_superpage_2mb().
>> 
>> 
>> 
>> * p2m_pod_demand_populate()
>> 
>> The trick of this function is to do remap_and_retry if
p2m_pod_cache_get()>> fails. When p2m_pod_get() fails, this function will splits p2m table
entry>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
guarantee>> populate demands always work.
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>> 
>> 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Mar-18 19:15 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Yes, I might.

 -- Keir


On 18/03/2009 17:45, "Huang2, Wei" <Wei.Huang2@amd.com> wrote:
> Keir,
> 
> Would you consider the middle approach (tools + normal p2m code) for
> 3.4? I understand that 1GB PoD is too big. But the middle one is much
> simpler.
> 
> Thanks,
> 
> -Wei
> 
> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Wednesday, March 18, 2009 12:33 PM
> To: George Dunlap; Huang2, Wei
> Cc: xen-devel@lists.xensource.com; Tim Deegan
> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
> 
> I''m not sure about putting this in for 3.4 unless there''s
a significant
> performance win.
> 
>  -- Keir
> 
> On 18/03/2009 17:20, "George Dunlap"
<George.Dunlap@eu.citrix.com>
> wrote:
> 
>> Thanks for doing this work, Wei -- especially all the extra effort for
>> the PoD integration.
>> 
>> One question: How well would you say you''ve tested the PoD
>> functionality?  Or to put it the other way, how much do I need to
>> prioritize testing this before the 3.4 release?
>> 
>> It wouldn''t be a bad idea to do as you suggested, and break
things
>> into 2 meg pages for the PoD case.  In order to take the best
>> advantage of this in a PoD scenario, you''d need to have a
balloon
>> driver that could allocate 1G of continuous *guest* p2m space, which
>> seems a bit optimistic at this point...
>> 
>>  -George
>> 
>> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>>> Current Xen supports 2MB super pages for NPT/EPT. The attached
> patches
>>> extend this feature to support 1GB pages. The PoD
> (populate-on-demand)
>>> introduced by George Dunlap made P2M modification harder. I tried
to
>>> preserve existing PoD design by introducing a 1GB PoD cache list.
>>> 
>>> 
>>> 
>>> Note that 1GB PoD can be dropped if we don''t care about
1GB when PoD
> is
>>> enabled. In this case, we can just split 1GB PDPE into 512x2MB PDE
> entries
>>> and grab pages from PoD super list. That can pretty much make
>>> 1gb_p2m_pod.patch go away.
>>> 
>>> 
>>> 
>>> Any comment/suggestion on design idea will be appreciated.
>>> 
>>> 
>>> 
>>> Thanks,
>>> 
>>> 
>>> 
>>> -Wei
>>> 
>>> 
>>> 
>>> 
>>> 
>>> The following is the description:
>>> 
>>> === 1gb_tools.patch ==>>> 
>>> Extend existing setup_guest() function. Basically, it tries to
> allocate 1GB
>>> pages whenever available. If this request fails, it falls back to
> 2MB. If
>>> both fail, then 4KB pages will be used.
>>> 
>>> 
>>> 
>>> === 1gb_p2m.patch ==>>> 
>>> * p2m_next_level()
>>> 
>>> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we
> split 1GB
>>> into 512 2MB pages.
>>> 
>>> 
>>> 
>>> * p2m_set_entry()
>>> 
>>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
>>> 
>>> 
>>> 
>>> * p2m_gfn_to_mfn()
>>> 
>>> Add support for 1GB case when doing gfn to mfn translation. When L3
> entry is
>>> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
> Otherwise,
>>> we do the regular address translation (gfn ==> mfn).
>>> 
>>> 
>>> 
>>> * p2m_gfn_to_mfn_current()
>>> 
>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
>>> POPULATE_ON_DEMAND, it demands a populate using
> p2m_pod_demand_populate().
>>> Otherwise, it does a normal translation. 1GB page is taken into
>>> consideration.
>>> 
>>> 
>>> 
>>> * set_p2m_entry()
>>> 
>>> Request 1GB page
>>> 
>>> 
>>> 
>>> * audit_p2m()
>>> 
>>> Support 1GB while auditing p2m table.
>>> 
>>> 
>>> 
>>> * p2m_change_type_global()
>>> 
>>> Deal with 1GB page when changing global page type.
>>> 
>>> 
>>> 
>>> === 1gb_p2m_pod.patch ==>>> 
>>> * xen/include/asm-x86/p2m.h
>>> 
>>> Minor change to deal with PoD. It separates super page cache list
> into 2MB
>>> and 1GB lists. Similarly, we record last gpfn of sweeping for both
> 2MB and
>>> 1GB.
>>> 
>>> 
>>> 
>>> * p2m_pod_cache_add()
>>> 
>>> Check page order and add 1GB super page into PoD 1GB cache list.
>>> 
>>> 
>>> 
>>> * p2m_pod_cache_get()
>>> 
>>> Grab a page from cache list. It tries to break 1GB page into 512
2MB
> pages
>>> if 2MB PoD list is empty. Similarly, 4KB can be requested from
super
> pages.
>>> The breaking order is 2MB then 1GB.
>>> 
>>> 
>>> 
>>> * p2m_pod_cache_target()
>>> 
>>> This function is used to set PoD cache size. To increase PoD
target,
> we try
>>> to allocate 1GB from xen domheap. If this fails, we try 2MB. If
both
> fail,
>>> we try 4KB which is guaranteed to work.
>>> 
>>> 
>>> 
>>> To decrease the target, we use a similar approach. We first try to
> free 1GB
>>> pages from 1GB PoD cache list. If such request fails, we try 2MB
PoD
> cache
>>> list. If both fail, we try 4KB list.
>>> 
>>> 
>>> 
>>> * p2m_pod_zero_check_superpage_1gb()
>>> 
>>> This adds a new function to check for 1GB page. This function is
> similar to
>>> p2m_pod_zero_check_superpage_2mb().
>>> 
>>> 
>>> 
>>> * p2m_pod_zero_check_superpage_1gb()
>>> 
>>> We add a new function to sweep 1GB page from guest memory. This is
> the same
>>> as p2m_pod_zero_check_superpage_2mb().
>>> 
>>> 
>>> 
>>> * p2m_pod_demand_populate()
>>> 
>>> The trick of this function is to do remap_and_retry if
> p2m_pod_cache_get()
>>> fails. When p2m_pod_get() fails, this function will splits p2m
table
> entry
>>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
> guarantee
>>> populate demands always work.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>> 
>>> 
> 
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Huang2, Wei

2009-Mar-19 08:07 UTC

head link

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Here are patches using the middle approach. It handles 1GB pages in PoD
by remapping 1GB with 2MB pages & retry. I also added code for 1GB
detection. Please comment.

Thanks a lot,

-Wei

-----Original Message-----
From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
Dunlap
Sent: Wednesday, March 18, 2009 12:20 PM
To: Huang2, Wei
Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; Tim Deegan
Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Thanks for doing this work, Wei -- especially all the extra effort for
the PoD integration.

One question: How well would you say you''ve tested the PoD
functionality?  Or to put it the other way, how much do I need to
prioritize testing this before the 3.4 release?

It wouldn''t be a bad idea to do as you suggested, and break things
into 2 meg pages for the PoD case.  In order to take the best
advantage of this in a PoD scenario, you''d need to have a balloon
driver that could allocate 1G of continuous *guest* p2m space, which
seems a bit optimistic at this point...

 -George

2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:> Current Xen supports 2MB super pages for NPT/EPT. The attached patches
> extend this feature to support 1GB pages. The PoD (populate-on-demand)
> introduced by George Dunlap made P2M modification harder. I tried to
> preserve existing PoD design by introducing a 1GB PoD cache list.
>
>
>
> Note that 1GB PoD can be dropped if we don''t care about 1GB when
PoD
is> enabled. In this case, we can just split 1GB PDPE into 512x2MB PDE
entries> and grab pages from PoD super list. That can pretty much make
> 1gb_p2m_pod.patch go away.
>
>
>
> Any comment/suggestion on design idea will be appreciated.
>
>
>
> Thanks,
>
>
>
> -Wei
>
>
>
>
>
> The following is the description:
>
> === 1gb_tools.patch ==>
> Extend existing setup_guest() function. Basically, it tries to
allocate 1GB> pages whenever available. If this request fails, it falls back to 2MB.
If> both fail, then 4KB pages will be used.
>
>
>
> === 1gb_p2m.patch ==>
> * p2m_next_level()
>
> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we
split 1GB> into 512 2MB pages.
>
>
>
> * p2m_set_entry()
>
> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
>
>
>
> * p2m_gfn_to_mfn()
>
> Add support for 1GB case when doing gfn to mfn translation. When L3
entry is> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
Otherwise,> we do the regular address translation (gfn ==> mfn).
>
>
>
> * p2m_gfn_to_mfn_current()
>
> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
> POPULATE_ON_DEMAND, it demands a populate using
p2m_pod_demand_populate().> Otherwise, it does a normal translation. 1GB page is taken into
> consideration.
>
>
>
> * set_p2m_entry()
>
> Request 1GB page
>
>
>
> * audit_p2m()
>
> Support 1GB while auditing p2m table.
>
>
>
> * p2m_change_type_global()
>
> Deal with 1GB page when changing global page type.
>
>
>
> === 1gb_p2m_pod.patch ==>
> * xen/include/asm-x86/p2m.h
>
> Minor change to deal with PoD. It separates super page cache list into
2MB> and 1GB lists. Similarly, we record last gpfn of sweeping for both 2MB
and> 1GB.
>
>
>
> * p2m_pod_cache_add()
>
> Check page order and add 1GB super page into PoD 1GB cache list.
>
>
>
> * p2m_pod_cache_get()
>
> Grab a page from cache list. It tries to break 1GB page into 512 2MB
pages> if 2MB PoD list is empty. Similarly, 4KB can be requested from super
pages.> The breaking order is 2MB then 1GB.
>
>
>
> * p2m_pod_cache_target()
>
> This function is used to set PoD cache size. To increase PoD target,
we try> to allocate 1GB from xen domheap. If this fails, we try 2MB. If both
fail,> we try 4KB which is guaranteed to work.
>
>
>
> To decrease the target, we use a similar approach. We first try to
free 1GB> pages from 1GB PoD cache list. If such request fails, we try 2MB PoD
cache> list. If both fail, we try 4KB list.
>
>
>
> * p2m_pod_zero_check_superpage_1gb()
>
> This adds a new function to check for 1GB page. This function is
similar to> p2m_pod_zero_check_superpage_2mb().
>
>
>
> * p2m_pod_zero_check_superpage_1gb()
>
> We add a new function to sweep 1GB page from guest memory. This is the
same> as p2m_pod_zero_check_superpage_2mb().
>
>
>
> * p2m_pod_demand_populate()
>
> The trick of this function is to do remap_and_retry if
p2m_pod_cache_get()> fails. When p2m_pod_get() fails, this function will splits p2m table
entry> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
guarantee> populate demands always work.
>
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2009-Mar-19 14:17 UTC

head link

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

I''d like to reiterate my argument raised in a previous
discussion of hugepages:  Just because this CAN be made
to work, doesn''t imply that it SHOULD be made to work.
Real users use larger pages in their OS for the sole
reason that they expect a performance improvement.
If it magically works, but works slow (and possibly
slower than if the OS had just used small pages to
start with), this is likely to lead to unsatisfied
customers, and perhaps allegations such as "Xen sucks
when running databases".

So, please, let''s think this through before implementing
it just because we can.  At a minimum, an administrator
should be somehow warned if large pages are getting splintered.

And if its going in over my objection, please tie it to
a boot option that defaults off so administrator action
is required to allow silent splintering.

My two cents...
Dan
> -----Original Message-----
> From: Huang2, Wei [mailto:Wei.Huang2@amd.com]
> Sent: Thursday, March 19, 2009 2:07 AM
> To: George Dunlap
> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; 
> Tim Deegan
> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
> 
> 
> Here are patches using the middle approach. It handles 1GB 
> pages in PoD
> by remapping 1GB with 2MB pages & retry. I also added code for 1GB
> detection. Please comment.
> 
> Thanks a lot,
> 
> -Wei
> 
> -----Original Message-----
> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
> Dunlap
> Sent: Wednesday, March 18, 2009 12:20 PM
> To: Huang2, Wei
> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; 
> Tim Deegan
> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
> 
> Thanks for doing this work, Wei -- especially all the extra effort for
> the PoD integration.
> 
> One question: How well would you say you''ve tested the PoD
> functionality?  Or to put it the other way, how much do I need to
> prioritize testing this before the 3.4 release?
> 
> It wouldn''t be a bad idea to do as you suggested, and break things
> into 2 meg pages for the PoD case.  In order to take the best
> advantage of this in a PoD scenario, you''d need to have a balloon
> driver that could allocate 1G of continuous *guest* p2m space, which
> seems a bit optimistic at this point...
> 
>  -George
> 
> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
> > Current Xen supports 2MB super pages for NPT/EPT. The 
> attached patches
> > extend this feature to support 1GB pages. The PoD 
> (populate-on-demand)
> > introduced by George Dunlap made P2M modification harder. I tried to
> > preserve existing PoD design by introducing a 1GB PoD cache list.
> >
> >
> >
> > Note that 1GB PoD can be dropped if we don''t care about 1GB
when PoD
> is
> > enabled. In this case, we can just split 1GB PDPE into 512x2MB PDE
> entries
> > and grab pages from PoD super list. That can pretty much make
> > 1gb_p2m_pod.patch go away.
> >
> >
> >
> > Any comment/suggestion on design idea will be appreciated.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > -Wei
> >
> >
> >
> >
> >
> > The following is the description:
> >
> > === 1gb_tools.patch ==> >
> > Extend existing setup_guest() function. Basically, it tries to
> allocate 1GB
> > pages whenever available. If this request fails, it falls 
> back to 2MB.
> If
> > both fail, then 4KB pages will be used.
> >
> >
> >
> > === 1gb_p2m.patch ==> >
> > * p2m_next_level()
> >
> > Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we
> split 1GB
> > into 512 2MB pages.
> >
> >
> >
> > * p2m_set_entry()
> >
> > Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
> >
> >
> >
> > * p2m_gfn_to_mfn()
> >
> > Add support for 1GB case when doing gfn to mfn translation. When L3
> entry is
> > marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
> Otherwise,
> > we do the regular address translation (gfn ==> mfn).
> >
> >
> >
> > * p2m_gfn_to_mfn_current()
> >
> > This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
> > POPULATE_ON_DEMAND, it demands a populate using
> p2m_pod_demand_populate().
> > Otherwise, it does a normal translation. 1GB page is taken into
> > consideration.
> >
> >
> >
> > * set_p2m_entry()
> >
> > Request 1GB page
> >
> >
> >
> > * audit_p2m()
> >
> > Support 1GB while auditing p2m table.
> >
> >
> >
> > * p2m_change_type_global()
> >
> > Deal with 1GB page when changing global page type.
> >
> >
> >
> > === 1gb_p2m_pod.patch ==> >
> > * xen/include/asm-x86/p2m.h
> >
> > Minor change to deal with PoD. It separates super page 
> cache list into
> 2MB
> > and 1GB lists. Similarly, we record last gpfn of sweeping 
> for both 2MB
> and
> > 1GB.
> >
> >
> >
> > * p2m_pod_cache_add()
> >
> > Check page order and add 1GB super page into PoD 1GB cache list.
> >
> >
> >
> > * p2m_pod_cache_get()
> >
> > Grab a page from cache list. It tries to break 1GB page into 512 2MB
> pages
> > if 2MB PoD list is empty. Similarly, 4KB can be requested from super
> pages.
> > The breaking order is 2MB then 1GB.
> >
> >
> >
> > * p2m_pod_cache_target()
> >
> > This function is used to set PoD cache size. To increase PoD target,
> we try
> > to allocate 1GB from xen domheap. If this fails, we try 2MB. If both
> fail,
> > we try 4KB which is guaranteed to work.
> >
> >
> >
> > To decrease the target, we use a similar approach. We first try to
> free 1GB
> > pages from 1GB PoD cache list. If such request fails, we try 2MB PoD
> cache
> > list. If both fail, we try 4KB list.
> >
> >
> >
> > * p2m_pod_zero_check_superpage_1gb()
> >
> > This adds a new function to check for 1GB page. This function is
> similar to
> > p2m_pod_zero_check_superpage_2mb().
> >
> >
> >
> > * p2m_pod_zero_check_superpage_1gb()
> >
> > We add a new function to sweep 1GB page from guest memory. 
> This is the
> same
> > as p2m_pod_zero_check_superpage_2mb().
> >
> >
> >
> > * p2m_pod_demand_populate()
> >
> > The trick of this function is to do remap_and_retry if
> p2m_pod_cache_get()
> > fails. When p2m_pod_get() fails, this function will splits p2m table
> entry
> > into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
> guarantee
> > populate demands always work.
> >
> >
> >
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel
> >
> >
> 
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei Huang

2009-Mar-19 18:51 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Dan,

Thanks for your comments. I am not sure about which splintering overhead 
you are referring to. I can think of three areas:

1. splintering in page allocation
In this case, Xen fails to allocate requested page order. So it falls 
back to smaller pages to setup p2m table. The overhead is 
O(guest_mem_size), which is a one-time deal.

2. P2M splits large page into smaller pages
This is one directional because we don''t merge smaller pages to large 
ones. The worst case is to split all guest large pages. So overhead is 
O(total_large_page_mem). In long run, the overhead will converge to 0 
because it is one-directional. Note this overhead also covers when PoD 
feature is enabled.

3. CPU splintering
If CPU does not support 1GB page, it automatically does splintering 
using smaller ones (such as 2MB). In this case, the overhead is always 
there. But 1) this only happens to a small number of old chips; 2) I 
believe that it is still faster than 4K pages. CPUID (1gb feature and 
1gb TLB entries) can be used to detect and stop this problem, if we 
don''t really like it.

I agree on your concerns. Customers should have the right to make their 
own decision. But that require new feature is enabled in the first 
place. For a lot of benchmarks, splintering overhead can be offset with 
benefits of huge pages. SPECJBB is a good example of using large pages 
(see Ben Serebrin''s presentation in Xen Summit). With that said, I
agree
with the idea of adding a new option in guest configure file.

-Wei


Dan Magenheimer wrote:> I''d like to reiterate my argument raised in a previous
> discussion of hugepages:  Just because this CAN be made
> to work, doesn''t imply that it SHOULD be made to work.
> Real users use larger pages in their OS for the sole
> reason that they expect a performance improvement.
> If it magically works, but works slow (and possibly
> slower than if the OS had just used small pages to
> start with), this is likely to lead to unsatisfied
> customers, and perhaps allegations such as "Xen sucks
> when running databases".
> 
> So, please, let''s think this through before implementing
> it just because we can.  At a minimum, an administrator
> should be somehow warned if large pages are getting splintered.
> 
> And if its going in over my objection, please tie it to
> a boot option that defaults off so administrator action
> is required to allow silent splintering.
> 
> My two cents...
> Dan
> 
>> -----Original Message-----
>> From: Huang2, Wei [mailto:Wei.Huang2@amd.com]
>> Sent: Thursday, March 19, 2009 2:07 AM
>> To: George Dunlap
>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; 
>> Tim Deegan
>> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
>>
>>
>> Here are patches using the middle approach. It handles 1GB 
>> pages in PoD
>> by remapping 1GB with 2MB pages & retry. I also added code for 1GB
>> detection. Please comment.
>>
>> Thanks a lot,
>>
>> -Wei
>>
>> -----Original Message-----
>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
>> Dunlap
>> Sent: Wednesday, March 18, 2009 12:20 PM
>> To: Huang2, Wei
>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; 
>> Tim Deegan
>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
>>
>> Thanks for doing this work, Wei -- especially all the extra effort for
>> the PoD integration.
>>
>> One question: How well would you say you''ve tested the PoD
>> functionality?  Or to put it the other way, how much do I need to
>> prioritize testing this before the 3.4 release?
>>
>> It wouldn''t be a bad idea to do as you suggested, and break
things
>> into 2 meg pages for the PoD case.  In order to take the best
>> advantage of this in a PoD scenario, you''d need to have a
balloon
>> driver that could allocate 1G of continuous *guest* p2m space, which
>> seems a bit optimistic at this point...
>>
>>  -George
>>
>> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>>> Current Xen supports 2MB super pages for NPT/EPT. The 
>> attached patches
>>> extend this feature to support 1GB pages. The PoD 
>> (populate-on-demand)
>>> introduced by George Dunlap made P2M modification harder. I tried
to
>>> preserve existing PoD design by introducing a 1GB PoD cache list.
>>>
>>>
>>>
>>> Note that 1GB PoD can be dropped if we don''t care about
1GB when PoD
>> is
>>> enabled. In this case, we can just split 1GB PDPE into 512x2MB PDE
>> entries
>>> and grab pages from PoD super list. That can pretty much make
>>> 1gb_p2m_pod.patch go away.
>>>
>>>
>>>
>>> Any comment/suggestion on design idea will be appreciated.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> -Wei
>>>
>>>
>>>
>>>
>>>
>>> The following is the description:
>>>
>>> === 1gb_tools.patch ==>>>
>>> Extend existing setup_guest() function. Basically, it tries to
>> allocate 1GB
>>> pages whenever available. If this request fails, it falls 
>> back to 2MB.
>> If
>>> both fail, then 4KB pages will be used.
>>>
>>>
>>>
>>> === 1gb_p2m.patch ==>>>
>>> * p2m_next_level()
>>>
>>> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we
>> split 1GB
>>> into 512 2MB pages.
>>>
>>>
>>>
>>> * p2m_set_entry()
>>>
>>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
>>>
>>>
>>>
>>> * p2m_gfn_to_mfn()
>>>
>>> Add support for 1GB case when doing gfn to mfn translation. When L3
>> entry is
>>> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
>> Otherwise,
>>> we do the regular address translation (gfn ==> mfn).
>>>
>>>
>>>
>>> * p2m_gfn_to_mfn_current()
>>>
>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
>>> POPULATE_ON_DEMAND, it demands a populate using
>> p2m_pod_demand_populate().
>>> Otherwise, it does a normal translation. 1GB page is taken into
>>> consideration.
>>>
>>>
>>>
>>> * set_p2m_entry()
>>>
>>> Request 1GB page
>>>
>>>
>>>
>>> * audit_p2m()
>>>
>>> Support 1GB while auditing p2m table.
>>>
>>>
>>>
>>> * p2m_change_type_global()
>>>
>>> Deal with 1GB page when changing global page type.
>>>
>>>
>>>
>>> === 1gb_p2m_pod.patch ==>>>
>>> * xen/include/asm-x86/p2m.h
>>>
>>> Minor change to deal with PoD. It separates super page 
>> cache list into
>> 2MB
>>> and 1GB lists. Similarly, we record last gpfn of sweeping 
>> for both 2MB
>> and
>>> 1GB.
>>>
>>>
>>>
>>> * p2m_pod_cache_add()
>>>
>>> Check page order and add 1GB super page into PoD 1GB cache list.
>>>
>>>
>>>
>>> * p2m_pod_cache_get()
>>>
>>> Grab a page from cache list. It tries to break 1GB page into 512
2MB
>> pages
>>> if 2MB PoD list is empty. Similarly, 4KB can be requested from
super
>> pages.
>>> The breaking order is 2MB then 1GB.
>>>
>>>
>>>
>>> * p2m_pod_cache_target()
>>>
>>> This function is used to set PoD cache size. To increase PoD
target,
>> we try
>>> to allocate 1GB from xen domheap. If this fails, we try 2MB. If
both
>> fail,
>>> we try 4KB which is guaranteed to work.
>>>
>>>
>>>
>>> To decrease the target, we use a similar approach. We first try to
>> free 1GB
>>> pages from 1GB PoD cache list. If such request fails, we try 2MB
PoD
>> cache
>>> list. If both fail, we try 4KB list.
>>>
>>>
>>>
>>> * p2m_pod_zero_check_superpage_1gb()
>>>
>>> This adds a new function to check for 1GB page. This function is
>> similar to
>>> p2m_pod_zero_check_superpage_2mb().
>>>
>>>
>>>
>>> * p2m_pod_zero_check_superpage_1gb()
>>>
>>> We add a new function to sweep 1GB page from guest memory. 
>> This is the
>> same
>>> as p2m_pod_zero_check_superpage_2mb().
>>>
>>>
>>>
>>> * p2m_pod_demand_populate()
>>>
>>> The trick of this function is to do remap_and_retry if
>> p2m_pod_cache_get()
>>> fails. When p2m_pod_get() fails, this function will splits p2m
table
>> entry
>>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
>> guarantee
>>> populate demands always work.
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>
>>>
>>
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2009-Mar-19 19:56 UTC

head link

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Hi Wei --

I''m not worried about the overhead of the splintering, I''m
worried about the "hidden overhead" everytime a "silent
splinter" is used.

Let''s assume three scenarios (and for now use 2MB pages though
the same concerns can be extended to 1GB and/or mixed 2MB/1GB):

A) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
   only 2MB pages (no splintering occurs)
B) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
   only 4KB pages (because of fragmentation, all 2MB pages have
   been splintered)
C) DB code assumes 4KB pages, OS assumes 4KB pages, Xen provides
   4KB pages

Now run some benchmarks.  Clearly one would assume that A is
faster than both B and C.  The question is: Is B faster or slower
than C?

If B is always faster than C, then I have less objection to
"silent splintering".  But if B is sometimes (or maybe always?)
slower than C, that''s a big issue because a user has gone through
the effort of choosing a better-performing system configuration
for their software (2MB DB on 2MB OS), but it actually performs
worse than if they had chosen the "lower performing" configuration.
And, worse, it will likely degrade across time so performance
might be fine when the 2MB-DB-on-2MB-OS guest is launched
but get much worse when it is paused, save/restored, migrated,
or hot-failed.  So even if B is only slightly faster than C,
if B is much slower than A, this is a problem.

Does that make sense?

Some suggestions:
1) If it is possible for an administrator to determine how many
   large pages (both 2MB and 1GB) were requested by each domain
   and how many are currently whole-vs-splintered, that would help.
2) We may need some form of memory defragmenter
> -----Original Message-----
> From: Wei Huang [mailto:wei.huang2@amd.com]
> Sent: Thursday, March 19, 2009 12:52 PM
> To: Dan Magenheimer
> Cc: George Dunlap; xen-devel@lists.xensource.com;
> keir.fraser@eu.citrix.com; Tim Deegan
> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
> 
> 
> Dan,
> 
> Thanks for your comments. I am not sure about which 
> splintering overhead 
> you are referring to. I can think of three areas:
> 
> 1. splintering in page allocation
> In this case, Xen fails to allocate requested page order. So it falls 
> back to smaller pages to setup p2m table. The overhead is 
> O(guest_mem_size), which is a one-time deal.
> 
> 2. P2M splits large page into smaller pages
> This is one directional because we don''t merge smaller pages to
large
> ones. The worst case is to split all guest large pages. So 
> overhead is 
> O(total_large_page_mem). In long run, the overhead will converge to 0 
> because it is one-directional. Note this overhead also covers 
> when PoD 
> feature is enabled.
> 
> 3. CPU splintering
> If CPU does not support 1GB page, it automatically does splintering 
> using smaller ones (such as 2MB). In this case, the overhead 
> is always 
> there. But 1) this only happens to a small number of old chips; 2) I 
> believe that it is still faster than 4K pages. CPUID (1gb feature and 
> 1gb TLB entries) can be used to detect and stop this problem, if we 
> don''t really like it.
> 
> I agree on your concerns. Customers should have the right to 
> make their 
> own decision. But that require new feature is enabled in the first 
> place. For a lot of benchmarks, splintering overhead can be 
> offset with 
> benefits of huge pages. SPECJBB is a good example of using 
> large pages 
> (see Ben Serebrin''s presentation in Xen Summit). With that 
> said, I agree 
> with the idea of adding a new option in guest configure file.
> 
> -Wei
> 
> 
> Dan Magenheimer wrote:
> > I''d like to reiterate my argument raised in a previous
> > discussion of hugepages:  Just because this CAN be made
> > to work, doesn''t imply that it SHOULD be made to work.
> > Real users use larger pages in their OS for the sole
> > reason that they expect a performance improvement.
> > If it magically works, but works slow (and possibly
> > slower than if the OS had just used small pages to
> > start with), this is likely to lead to unsatisfied
> > customers, and perhaps allegations such as "Xen sucks
> > when running databases".
> > 
> > So, please, let''s think this through before implementing
> > it just because we can.  At a minimum, an administrator
> > should be somehow warned if large pages are getting splintered.
> > 
> > And if its going in over my objection, please tie it to
> > a boot option that defaults off so administrator action
> > is required to allow silent splintering.
> > 
> > My two cents...
> > Dan
> > 
> >> -----Original Message-----
> >> From: Huang2, Wei [mailto:Wei.Huang2@amd.com]
> >> Sent: Thursday, March 19, 2009 2:07 AM
> >> To: George Dunlap
> >> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; 
> >> Tim Deegan
> >> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
> >>
> >>
> >> Here are patches using the middle approach. It handles 1GB 
> >> pages in PoD
> >> by remapping 1GB with 2MB pages & retry. I also added code for
1GB
> >> detection. Please comment.
> >>
> >> Thanks a lot,
> >>
> >> -Wei
> >>
> >> -----Original Message-----
> >> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On 
> Behalf Of George
> >> Dunlap
> >> Sent: Wednesday, March 18, 2009 12:20 PM
> >> To: Huang2, Wei
> >> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; 
> >> Tim Deegan
> >> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
> >>
> >> Thanks for doing this work, Wei -- especially all the 
> extra effort for
> >> the PoD integration.
> >>
> >> One question: How well would you say you''ve tested the
PoD
> >> functionality?  Or to put it the other way, how much do I need to
> >> prioritize testing this before the 3.4 release?
> >>
> >> It wouldn''t be a bad idea to do as you suggested, and
break things
> >> into 2 meg pages for the PoD case.  In order to take the best
> >> advantage of this in a PoD scenario, you''d need to have a
balloon
> >> driver that could allocate 1G of continuous *guest* p2m 
> space, which
> >> seems a bit optimistic at this point...
> >>
> >>  -George
> >>
> >> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
> >>> Current Xen supports 2MB super pages for NPT/EPT. The 
> >> attached patches
> >>> extend this feature to support 1GB pages. The PoD 
> >> (populate-on-demand)
> >>> introduced by George Dunlap made P2M modification harder. 
> I tried to
> >>> preserve existing PoD design by introducing a 1GB PoD cache
list.
> >>>
> >>>
> >>>
> >>> Note that 1GB PoD can be dropped if we don''t care
about
> 1GB when PoD
> >> is
> >>> enabled. In this case, we can just split 1GB PDPE into 512x2MB
PDE
> >> entries
> >>> and grab pages from PoD super list. That can pretty much make
> >>> 1gb_p2m_pod.patch go away.
> >>>
> >>>
> >>>
> >>> Any comment/suggestion on design idea will be appreciated.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>
> >>>
> >>> -Wei
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> The following is the description:
> >>>
> >>> === 1gb_tools.patch ==> >>>
> >>> Extend existing setup_guest() function. Basically, it tries to
> >> allocate 1GB
> >>> pages whenever available. If this request fails, it falls 
> >> back to 2MB.
> >> If
> >>> both fail, then 4KB pages will be used.
> >>>
> >>>
> >>>
> >>> === 1gb_p2m.patch ==> >>>
> >>> * p2m_next_level()
> >>>
> >>> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1),
we
> >> split 1GB
> >>> into 512 2MB pages.
> >>>
> >>>
> >>>
> >>> * p2m_set_entry()
> >>>
> >>> Configure the PSE bit of L3 P2M table if page order == 18
(1GB).
> >>>
> >>>
> >>>
> >>> * p2m_gfn_to_mfn()
> >>>
> >>> Add support for 1GB case when doing gfn to mfn 
> translation. When L3
> >> entry is
> >>> marked as POPULATE_ON_DEMAND, we call
2m_pod_demand_populate().
> >> Otherwise,
> >>> we do the regular address translation (gfn ==> mfn).
> >>>
> >>>
> >>>
> >>> * p2m_gfn_to_mfn_current()
> >>>
> >>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
> >>> POPULATE_ON_DEMAND, it demands a populate using
> >> p2m_pod_demand_populate().
> >>> Otherwise, it does a normal translation. 1GB page is taken
into
> >>> consideration.
> >>>
> >>>
> >>>
> >>> * set_p2m_entry()
> >>>
> >>> Request 1GB page
> >>>
> >>>
> >>>
> >>> * audit_p2m()
> >>>
> >>> Support 1GB while auditing p2m table.
> >>>
> >>>
> >>>
> >>> * p2m_change_type_global()
> >>>
> >>> Deal with 1GB page when changing global page type.
> >>>
> >>>
> >>>
> >>> === 1gb_p2m_pod.patch ==> >>>
> >>> * xen/include/asm-x86/p2m.h
> >>>
> >>> Minor change to deal with PoD. It separates super page 
> >> cache list into
> >> 2MB
> >>> and 1GB lists. Similarly, we record last gpfn of sweeping 
> >> for both 2MB
> >> and
> >>> 1GB.
> >>>
> >>>
> >>>
> >>> * p2m_pod_cache_add()
> >>>
> >>> Check page order and add 1GB super page into PoD 1GB cache
list.
> >>>
> >>>
> >>>
> >>> * p2m_pod_cache_get()
> >>>
> >>> Grab a page from cache list. It tries to break 1GB page 
> into 512 2MB
> >> pages
> >>> if 2MB PoD list is empty. Similarly, 4KB can be requested 
> from super
> >> pages.
> >>> The breaking order is 2MB then 1GB.
> >>>
> >>>
> >>>
> >>> * p2m_pod_cache_target()
> >>>
> >>> This function is used to set PoD cache size. To increase 
> PoD target,
> >> we try
> >>> to allocate 1GB from xen domheap. If this fails, we try 
> 2MB. If both
> >> fail,
> >>> we try 4KB which is guaranteed to work.
> >>>
> >>>
> >>>
> >>> To decrease the target, we use a similar approach. We first
try to
> >> free 1GB
> >>> pages from 1GB PoD cache list. If such request fails, we 
> try 2MB PoD
> >> cache
> >>> list. If both fail, we try 4KB list.
> >>>
> >>>
> >>>
> >>> * p2m_pod_zero_check_superpage_1gb()
> >>>
> >>> This adds a new function to check for 1GB page. This function
is
> >> similar to
> >>> p2m_pod_zero_check_superpage_2mb().
> >>>
> >>>
> >>>
> >>> * p2m_pod_zero_check_superpage_1gb()
> >>>
> >>> We add a new function to sweep 1GB page from guest memory. 
> >> This is the
> >> same
> >>> as p2m_pod_zero_check_superpage_2mb().
> >>>
> >>>
> >>>
> >>> * p2m_pod_demand_populate()
> >>>
> >>> The trick of this function is to do remap_and_retry if
> >> p2m_pod_cache_get()
> >>> fails. When p2m_pod_get() fails, this function will 
> splits p2m table
> >> entry
> >>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB).
That can
> >> guarantee
> >>> populate demands always work.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Xen-devel mailing list
> >>> Xen-devel@lists.xensource.com
> >>> http://lists.xensource.com/xen-devel
> >>>
> >>>
> >>
> > 
> 
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei Huang

2009-Mar-19 21:07 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Dan,

Regarding the concern of 2MBx4KB vs. 4KBx4KB, here is one paper can 
"partially" decode it:
http://www.amd64.org/fileadmin/user_upload/pub/p26-bhargava.pdf

Figure 3 shows that the distribution in 2D page table walk reference. 
Compared with 4KBx4KB, 2MBx4KB can eliminate the whole line of gL1, 
which contributes significantly to the total walk.

The only concern is the degradation from TLB flush (invlpg) because 512 
4KB TLB entries are required to back 1 2MB entry. The stress on CPU TLB 
is higher in this case. But our design team believes that it isn''t very
frequent.

I agree that the best case is to avoid splintering. So defragment is a 
good feature to have in the future, although it requires lots of rework 
on Xen MMU. Regarding Your first suggestion, it is actually easy to 
implement. We can put some statistical counters into p2m code and print 
out whenever needed.

-Wei

Dan Magenheimer wrote:> Hi Wei --
> 
> I''m not worried about the overhead of the splintering,
I''m
> worried about the "hidden overhead" everytime a "silent
> splinter" is used.
> 
> Let''s assume three scenarios (and for now use 2MB pages though
> the same concerns can be extended to 1GB and/or mixed 2MB/1GB):
> 
> A) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
>    only 2MB pages (no splintering occurs)
> B) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
>    only 4KB pages (because of fragmentation, all 2MB pages have
>    been splintered)
> C) DB code assumes 4KB pages, OS assumes 4KB pages, Xen provides
>    4KB pages
> 
> Now run some benchmarks.  Clearly one would assume that A is
> faster than both B and C.  The question is: Is B faster or slower
> than C?
> 
> If B is always faster than C, then I have less objection to
> "silent splintering".  But if B is sometimes (or maybe always?)
> slower than C, that''s a big issue because a user has gone through
> the effort of choosing a better-performing system configuration
> for their software (2MB DB on 2MB OS), but it actually performs
> worse than if they had chosen the "lower performing"
configuration.
> And, worse, it will likely degrade across time so performance
> might be fine when the 2MB-DB-on-2MB-OS guest is launched
> but get much worse when it is paused, save/restored, migrated,
> or hot-failed.  So even if B is only slightly faster than C,
> if B is much slower than A, this is a problem.
> 
> Does that make sense?
> 
> Some suggestions:
> 1) If it is possible for an administrator to determine how many
>    large pages (both 2MB and 1GB) were requested by each domain
>    and how many are currently whole-vs-splintered, that would help.
> 2) We may need some form of memory defragmenter
> 
>> -----Original Message-----
>> From: Wei Huang [mailto:wei.huang2@amd.com]
>> Sent: Thursday, March 19, 2009 12:52 PM
>> To: Dan Magenheimer
>> Cc: George Dunlap; xen-devel@lists.xensource.com;
>> keir.fraser@eu.citrix.com; Tim Deegan
>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
>>
>>
>> Dan,
>>
>> Thanks for your comments. I am not sure about which 
>> splintering overhead 
>> you are referring to. I can think of three areas:
>>
>> 1. splintering in page allocation
>> In this case, Xen fails to allocate requested page order. So it falls 
>> back to smaller pages to setup p2m table. The overhead is 
>> O(guest_mem_size), which is a one-time deal.
>>
>> 2. P2M splits large page into smaller pages
>> This is one directional because we don''t merge smaller pages
to large
>> ones. The worst case is to split all guest large pages. So 
>> overhead is 
>> O(total_large_page_mem). In long run, the overhead will converge to 0 
>> because it is one-directional. Note this overhead also covers 
>> when PoD 
>> feature is enabled.
>>
>> 3. CPU splintering
>> If CPU does not support 1GB page, it automatically does splintering 
>> using smaller ones (such as 2MB). In this case, the overhead 
>> is always 
>> there. But 1) this only happens to a small number of old chips; 2) I 
>> believe that it is still faster than 4K pages. CPUID (1gb feature and 
>> 1gb TLB entries) can be used to detect and stop this problem, if we 
>> don''t really like it.
>>
>> I agree on your concerns. Customers should have the right to 
>> make their 
>> own decision. But that require new feature is enabled in the first 
>> place. For a lot of benchmarks, splintering overhead can be 
>> offset with 
>> benefits of huge pages. SPECJBB is a good example of using 
>> large pages 
>> (see Ben Serebrin''s presentation in Xen Summit). With that 
>> said, I agree 
>> with the idea of adding a new option in guest configure file.
>>
>> -Wei
>>
>>
>> Dan Magenheimer wrote:
>>> I''d like to reiterate my argument raised in a previous
>>> discussion of hugepages:  Just because this CAN be made
>>> to work, doesn''t imply that it SHOULD be made to work.
>>> Real users use larger pages in their OS for the sole
>>> reason that they expect a performance improvement.
>>> If it magically works, but works slow (and possibly
>>> slower than if the OS had just used small pages to
>>> start with), this is likely to lead to unsatisfied
>>> customers, and perhaps allegations such as "Xen sucks
>>> when running databases".
>>>
>>> So, please, let''s think this through before implementing
>>> it just because we can.  At a minimum, an administrator
>>> should be somehow warned if large pages are getting splintered.
>>>
>>> And if its going in over my objection, please tie it to
>>> a boot option that defaults off so administrator action
>>> is required to allow silent splintering.
>>>
>>> My two cents...
>>> Dan
>>>
>>>> -----Original Message-----
>>>> From: Huang2, Wei [mailto:Wei.Huang2@amd.com]
>>>> Sent: Thursday, March 19, 2009 2:07 AM
>>>> To: George Dunlap
>>>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; 
>>>> Tim Deegan
>>>> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table
Support
>>>>
>>>>
>>>> Here are patches using the middle approach. It handles 1GB 
>>>> pages in PoD
>>>> by remapping 1GB with 2MB pages & retry. I also added code
for 1GB
>>>> detection. Please comment.
>>>>
>>>> Thanks a lot,
>>>>
>>>> -Wei
>>>>
>>>> -----Original Message-----
>>>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On 
>> Behalf Of George
>>>> Dunlap
>>>> Sent: Wednesday, March 18, 2009 12:20 PM
>>>> To: Huang2, Wei
>>>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; 
>>>> Tim Deegan
>>>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table
Support
>>>>
>>>> Thanks for doing this work, Wei -- especially all the 
>> extra effort for
>>>> the PoD integration.
>>>>
>>>> One question: How well would you say you''ve tested the
PoD
>>>> functionality?  Or to put it the other way, how much do I need
to
>>>> prioritize testing this before the 3.4 release?
>>>>
>>>> It wouldn''t be a bad idea to do as you suggested, and
break things
>>>> into 2 meg pages for the PoD case.  In order to take the best
>>>> advantage of this in a PoD scenario, you''d need to
have a balloon
>>>> driver that could allocate 1G of continuous *guest* p2m 
>> space, which
>>>> seems a bit optimistic at this point...
>>>>
>>>>  -George
>>>>
>>>> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>>>>> Current Xen supports 2MB super pages for NPT/EPT. The 
>>>> attached patches
>>>>> extend this feature to support 1GB pages. The PoD 
>>>> (populate-on-demand)
>>>>> introduced by George Dunlap made P2M modification harder. 
>> I tried to
>>>>> preserve existing PoD design by introducing a 1GB PoD cache
list.
>>>>>
>>>>>
>>>>>
>>>>> Note that 1GB PoD can be dropped if we don''t care
about
>> 1GB when PoD
>>>> is
>>>>> enabled. In this case, we can just split 1GB PDPE into
512x2MB PDE
>>>> entries
>>>>> and grab pages from PoD super list. That can pretty much
make
>>>>> 1gb_p2m_pod.patch go away.
>>>>>
>>>>>
>>>>>
>>>>> Any comment/suggestion on design idea will be appreciated.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>>
>>>>> -Wei
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The following is the description:
>>>>>
>>>>> === 1gb_tools.patch ==>>>>>
>>>>> Extend existing setup_guest() function. Basically, it tries
to
>>>> allocate 1GB
>>>>> pages whenever available. If this request fails, it falls 
>>>> back to 2MB.
>>>> If
>>>>> both fail, then 4KB pages will be used.
>>>>>
>>>>>
>>>>>
>>>>> === 1gb_p2m.patch ==>>>>>
>>>>> * p2m_next_level()
>>>>>
>>>>> Check PSE bit of L3 page table entry. If 1GB is found
(PSE=1), we
>>>> split 1GB
>>>>> into 512 2MB pages.
>>>>>
>>>>>
>>>>>
>>>>> * p2m_set_entry()
>>>>>
>>>>> Configure the PSE bit of L3 P2M table if page order == 18
(1GB).
>>>>>
>>>>>
>>>>>
>>>>> * p2m_gfn_to_mfn()
>>>>>
>>>>> Add support for 1GB case when doing gfn to mfn 
>> translation. When L3
>>>> entry is
>>>>> marked as POPULATE_ON_DEMAND, we call
2m_pod_demand_populate().
>>>> Otherwise,
>>>>> we do the regular address translation (gfn ==> mfn).
>>>>>
>>>>>
>>>>>
>>>>> * p2m_gfn_to_mfn_current()
>>>>>
>>>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked
as
>>>>> POPULATE_ON_DEMAND, it demands a populate using
>>>> p2m_pod_demand_populate().
>>>>> Otherwise, it does a normal translation. 1GB page is taken
into
>>>>> consideration.
>>>>>
>>>>>
>>>>>
>>>>> * set_p2m_entry()
>>>>>
>>>>> Request 1GB page
>>>>>
>>>>>
>>>>>
>>>>> * audit_p2m()
>>>>>
>>>>> Support 1GB while auditing p2m table.
>>>>>
>>>>>
>>>>>
>>>>> * p2m_change_type_global()
>>>>>
>>>>> Deal with 1GB page when changing global page type.
>>>>>
>>>>>
>>>>>
>>>>> === 1gb_p2m_pod.patch ==>>>>>
>>>>> * xen/include/asm-x86/p2m.h
>>>>>
>>>>> Minor change to deal with PoD. It separates super page 
>>>> cache list into
>>>> 2MB
>>>>> and 1GB lists. Similarly, we record last gpfn of sweeping 
>>>> for both 2MB
>>>> and
>>>>> 1GB.
>>>>>
>>>>>
>>>>>
>>>>> * p2m_pod_cache_add()
>>>>>
>>>>> Check page order and add 1GB super page into PoD 1GB cache
list.
>>>>>
>>>>>
>>>>>
>>>>> * p2m_pod_cache_get()
>>>>>
>>>>> Grab a page from cache list. It tries to break 1GB page 
>> into 512 2MB
>>>> pages
>>>>> if 2MB PoD list is empty. Similarly, 4KB can be requested 
>> from super
>>>> pages.
>>>>> The breaking order is 2MB then 1GB.
>>>>>
>>>>>
>>>>>
>>>>> * p2m_pod_cache_target()
>>>>>
>>>>> This function is used to set PoD cache size. To increase 
>> PoD target,
>>>> we try
>>>>> to allocate 1GB from xen domheap. If this fails, we try 
>> 2MB. If both
>>>> fail,
>>>>> we try 4KB which is guaranteed to work.
>>>>>
>>>>>
>>>>>
>>>>> To decrease the target, we use a similar approach. We first
try to
>>>> free 1GB
>>>>> pages from 1GB PoD cache list. If such request fails, we 
>> try 2MB PoD
>>>> cache
>>>>> list. If both fail, we try 4KB list.
>>>>>
>>>>>
>>>>>
>>>>> * p2m_pod_zero_check_superpage_1gb()
>>>>>
>>>>> This adds a new function to check for 1GB page. This
function is
>>>> similar to
>>>>> p2m_pod_zero_check_superpage_2mb().
>>>>>
>>>>>
>>>>>
>>>>> * p2m_pod_zero_check_superpage_1gb()
>>>>>
>>>>> We add a new function to sweep 1GB page from guest memory. 
>>>> This is the
>>>> same
>>>>> as p2m_pod_zero_check_superpage_2mb().
>>>>>
>>>>>
>>>>>
>>>>> * p2m_pod_demand_populate()
>>>>>
>>>>> The trick of this function is to do remap_and_retry if
>>>> p2m_pod_cache_get()
>>>>> fails. When p2m_pod_get() fails, this function will 
>> splits p2m table
>>>> entry
>>>>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB).
That can
>>>> guarantee
>>>>> populate demands always work.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xensource.com
>>>>> http://lists.xensource.com/xen-devel
>>>>>
>>>>>
>>
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2009-Mar-20 09:45 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Dan,

Don''t forget that this is about the p2m table, which is (if I
understand
correctly) orthogonal to what the guest pagetables are doing.  So the 
scenario, if HAP is used, would be:

A) DB code uses 2MB pages in guest PTs, OS assumes 2MB pages, guest PTs 
use 2MB pages, P2M uses 2MB pages
 - A tlb miss requires 3 * 3 = 9 reads (Assuming 64-bit guest)
B) DB code uses 2MB pages, OS uses 2MB pages, p2m uses 4K pages
 - A tlb miss requires 3 * 4 = 12 reads
C) DB code uses 4k pages, OS uses 4k pages, p2m uses 2MB pages
 - A tlb miss requires 4 * 3 = 12 reads
D) DB code uses 4k pages, OS uses 4k pages, p2m uses 4k pages
 - A tlb miss requires 4 * 4 = 16 reads

And adding the 1G p2m entries will change the multiplier from 3 to 2 
(i.e., 3*2 = 6 reads for superpages, 4*2 = 8 reads for 4k guest pages).

(Those who are more familiar with the hardware, please correct me if 
I''ve made some mistakes or oversimplified things.)

So adding 1G pages to the p2m table shouldn''t change expectations of
the
guest OS in any case.  Using it will benefit the guest to the same 
degree whether the guest is using 4k, 2Mb, or 1G pages. (If I understand 
correctly.)

 -George

Dan Magenheimer wrote:> Hi Wei --
>
> I''m not worried about the overhead of the splintering,
I''m
> worried about the "hidden overhead" everytime a "silent
> splinter" is used.
>
> Let''s assume three scenarios (and for now use 2MB pages though
> the same concerns can be extended to 1GB and/or mixed 2MB/1GB):
>
> A) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
>    only 2MB pages (no splintering occurs)
> B) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
>    only 4KB pages (because of fragmentation, all 2MB pages have
>    been splintered)
> C) DB code assumes 4KB pages, OS assumes 4KB pages, Xen provides
>    4KB pages
>
> Now run some benchmarks.  Clearly one would assume that A is
> faster than both B and C.  The question is: Is B faster or slower
> than C?
>
> If B is always faster than C, then I have less objection to
> "silent splintering".  But if B is sometimes (or maybe always?)
> slower than C, that''s a big issue because a user has gone through
> the effort of choosing a better-performing system configuration
> for their software (2MB DB on 2MB OS), but it actually performs
> worse than if they had chosen the "lower performing"
configuration.
> And, worse, it will likely degrade across time so performance
> might be fine when the 2MB-DB-on-2MB-OS guest is launched
> but get much worse when it is paused, save/restored, migrated,
> or hot-failed.  So even if B is only slightly faster than C,
> if B is much slower than A, this is a problem.
>
> Does that make sense?
>
> Some suggestions:
> 1) If it is possible for an administrator to determine how many
>    large pages (both 2MB and 1GB) were requested by each domain
>    and how many are currently whole-vs-splintered, that would help.
> 2) We may need some form of memory defragmenter
>
>   
>> -----Original Message-----
>> From: Wei Huang [mailto:wei.huang2@amd.com]
>> Sent: Thursday, March 19, 2009 12:52 PM
>> To: Dan Magenheimer
>> Cc: George Dunlap; xen-devel@lists.xensource.com;
>> keir.fraser@eu.citrix.com; Tim Deegan
>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
>>
>>
>> Dan,
>>
>> Thanks for your comments. I am not sure about which 
>> splintering overhead 
>> you are referring to. I can think of three areas:
>>
>> 1. splintering in page allocation
>> In this case, Xen fails to allocate requested page order. So it falls 
>> back to smaller pages to setup p2m table. The overhead is 
>> O(guest_mem_size), which is a one-time deal.
>>
>> 2. P2M splits large page into smaller pages
>> This is one directional because we don''t merge smaller pages
to large
>> ones. The worst case is to split all guest large pages. So 
>> overhead is 
>> O(total_large_page_mem). In long run, the overhead will converge to 0 
>> because it is one-directional. Note this overhead also covers 
>> when PoD 
>> feature is enabled.
>>
>> 3. CPU splintering
>> If CPU does not support 1GB page, it automatically does splintering 
>> using smaller ones (such as 2MB). In this case, the overhead 
>> is always 
>> there. But 1) this only happens to a small number of old chips; 2) I 
>> believe that it is still faster than 4K pages. CPUID (1gb feature and 
>> 1gb TLB entries) can be used to detect and stop this problem, if we 
>> don''t really like it.
>>
>> I agree on your concerns. Customers should have the right to 
>> make their 
>> own decision. But that require new feature is enabled in the first 
>> place. For a lot of benchmarks, splintering overhead can be 
>> offset with 
>> benefits of huge pages. SPECJBB is a good example of using 
>> large pages 
>> (see Ben Serebrin''s presentation in Xen Summit). With that 
>> said, I agree 
>> with the idea of adding a new option in guest configure file.
>>
>> -Wei
>>
>>
>> Dan Magenheimer wrote:
>>     
>>> I''d like to reiterate my argument raised in a previous
>>> discussion of hugepages:  Just because this CAN be made
>>> to work, doesn''t imply that it SHOULD be made to work.
>>> Real users use larger pages in their OS for the sole
>>> reason that they expect a performance improvement.
>>> If it magically works, but works slow (and possibly
>>> slower than if the OS had just used small pages to
>>> start with), this is likely to lead to unsatisfied
>>> customers, and perhaps allegations such as "Xen sucks
>>> when running databases".
>>>
>>> So, please, let''s think this through before implementing
>>> it just because we can.  At a minimum, an administrator
>>> should be somehow warned if large pages are getting splintered.
>>>
>>> And if its going in over my objection, please tie it to
>>> a boot option that defaults off so administrator action
>>> is required to allow silent splintering.
>>>
>>> My two cents...
>>> Dan
>>>
>>>       
>>>> -----Original Message-----
>>>> From: Huang2, Wei [mailto:Wei.Huang2@amd.com]
>>>> Sent: Thursday, March 19, 2009 2:07 AM
>>>> To: George Dunlap
>>>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; 
>>>> Tim Deegan
>>>> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table
Support
>>>>
>>>>
>>>> Here are patches using the middle approach. It handles 1GB 
>>>> pages in PoD
>>>> by remapping 1GB with 2MB pages & retry. I also added code
for 1GB
>>>> detection. Please comment.
>>>>
>>>> Thanks a lot,
>>>>
>>>> -Wei
>>>>
>>>> -----Original Message-----
>>>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On 
>>>>         
>> Behalf Of George
>>     
>>>> Dunlap
>>>> Sent: Wednesday, March 18, 2009 12:20 PM
>>>> To: Huang2, Wei
>>>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; 
>>>> Tim Deegan
>>>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table
Support
>>>>
>>>> Thanks for doing this work, Wei -- especially all the 
>>>>         
>> extra effort for
>>     
>>>> the PoD integration.
>>>>
>>>> One question: How well would you say you''ve tested the
PoD
>>>> functionality?  Or to put it the other way, how much do I need
to
>>>> prioritize testing this before the 3.4 release?
>>>>
>>>> It wouldn''t be a bad idea to do as you suggested, and
break things
>>>> into 2 meg pages for the PoD case.  In order to take the best
>>>> advantage of this in a PoD scenario, you''d need to
have a balloon
>>>> driver that could allocate 1G of continuous *guest* p2m 
>>>>         
>> space, which
>>     
>>>> seems a bit optimistic at this point...
>>>>
>>>>  -George
>>>>
>>>> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>>>>         
>>>>> Current Xen supports 2MB super pages for NPT/EPT. The 
>>>>>           
>>>> attached patches
>>>>         
>>>>> extend this feature to support 1GB pages. The PoD 
>>>>>           
>>>> (populate-on-demand)
>>>>         
>>>>> introduced by George Dunlap made P2M modification harder. 
>>>>>           
>> I tried to
>>     
>>>>> preserve existing PoD design by introducing a 1GB PoD cache
list.
>>>>>
>>>>>
>>>>>
>>>>> Note that 1GB PoD can be dropped if we don''t care
about
>>>>>           
>> 1GB when PoD
>>     
>>>> is
>>>>         
>>>>> enabled. In this case, we can just split 1GB PDPE into
512x2MB PDE
>>>>>           
>>>> entries
>>>>         
>>>>> and grab pages from PoD super list. That can pretty much
make
>>>>> 1gb_p2m_pod.patch go away.
>>>>>
>>>>>
>>>>>
>>>>> Any comment/suggestion on design idea will be appreciated.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>>
>>>>> -Wei
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The following is the description:
>>>>>
>>>>> === 1gb_tools.patch ==>>>>>
>>>>> Extend existing setup_guest() function. Basically, it tries
to
>>>>>           
>>>> allocate 1GB
>>>>         
>>>>> pages whenever available. If this request fails, it falls 
>>>>>           
>>>> back to 2MB.
>>>> If
>>>>         
>>>>> both fail, then 4KB pages will be used.
>>>>>
>>>>>
>>>>>
>>>>> === 1gb_p2m.patch ==>>>>>
>>>>> * p2m_next_level()
>>>>>
>>>>> Check PSE bit of L3 page table entry. If 1GB is found
(PSE=1), we
>>>>>           
>>>> split 1GB
>>>>         
>>>>> into 512 2MB pages.
>>>>>
>>>>>
>>>>>
>>>>> * p2m_set_entry()
>>>>>
>>>>> Configure the PSE bit of L3 P2M table if page order == 18
(1GB).
>>>>>
>>>>>
>>>>>
>>>>> * p2m_gfn_to_mfn()
>>>>>
>>>>> Add support for 1GB case when doing gfn to mfn 
>>>>>           
>> translation. When L3
>>     
>>>> entry is
>>>>         
>>>>> marked as POPULATE_ON_DEMAND, we call
2m_pod_demand_populate().
>>>>>           
>>>> Otherwise,
>>>>         
>>>>> we do the regular address translation (gfn ==> mfn).
>>>>>
>>>>>
>>>>>
>>>>> * p2m_gfn_to_mfn_current()
>>>>>
>>>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked
as
>>>>> POPULATE_ON_DEMAND, it demands a populate using
>>>>>           
>>>> p2m_pod_demand_populate().
>>>>         
>>>>> Otherwise, it does a normal translation. 1GB page is taken
into
>>>>> consideration.
>>>>>
>>>>>
>>>>>
>>>>> * set_p2m_entry()
>>>>>
>>>>> Request 1GB page
>>>>>
>>>>>
>>>>>
>>>>> * audit_p2m()
>>>>>
>>>>> Support 1GB while auditing p2m table.
>>>>>
>>>>>
>>>>>
>>>>> * p2m_change_type_global()
>>>>>
>>>>> Deal with 1GB page when changing global page type.
>>>>>
>>>>>
>>>>>
>>>>> === 1gb_p2m_pod.patch ==>>>>>
>>>>> * xen/include/asm-x86/p2m.h
>>>>>
>>>>> Minor change to deal with PoD. It separates super page 
>>>>>           
>>>> cache list into
>>>> 2MB
>>>>         
>>>>> and 1GB lists. Similarly, we record last gpfn of sweeping 
>>>>>           
>>>> for both 2MB
>>>> and
>>>>         
>>>>> 1GB.
>>>>>
>>>>>
>>>>>
>>>>> * p2m_pod_cache_add()
>>>>>
>>>>> Check page order and add 1GB super page into PoD 1GB cache
list.
>>>>>
>>>>>
>>>>>
>>>>> * p2m_pod_cache_get()
>>>>>
>>>>> Grab a page from cache list. It tries to break 1GB page 
>>>>>           
>> into 512 2MB
>>     
>>>> pages
>>>>         
>>>>> if 2MB PoD list is empty. Similarly, 4KB can be requested 
>>>>>           
>> from super
>>     
>>>> pages.
>>>>         
>>>>> The breaking order is 2MB then 1GB.
>>>>>
>>>>>
>>>>>
>>>>> * p2m_pod_cache_target()
>>>>>
>>>>> This function is used to set PoD cache size. To increase 
>>>>>           
>> PoD target,
>>     
>>>> we try
>>>>         
>>>>> to allocate 1GB from xen domheap. If this fails, we try 
>>>>>           
>> 2MB. If both
>>     
>>>> fail,
>>>>         
>>>>> we try 4KB which is guaranteed to work.
>>>>>
>>>>>
>>>>>
>>>>> To decrease the target, we use a similar approach. We first
try to
>>>>>           
>>>> free 1GB
>>>>         
>>>>> pages from 1GB PoD cache list. If such request fails, we 
>>>>>           
>> try 2MB PoD
>>     
>>>> cache
>>>>         
>>>>> list. If both fail, we try 4KB list.
>>>>>
>>>>>
>>>>>
>>>>> * p2m_pod_zero_check_superpage_1gb()
>>>>>
>>>>> This adds a new function to check for 1GB page. This
function is
>>>>>           
>>>> similar to
>>>>         
>>>>> p2m_pod_zero_check_superpage_2mb().
>>>>>
>>>>>
>>>>>
>>>>> * p2m_pod_zero_check_superpage_1gb()
>>>>>
>>>>> We add a new function to sweep 1GB page from guest memory. 
>>>>>           
>>>> This is the
>>>> same
>>>>         
>>>>> as p2m_pod_zero_check_superpage_2mb().
>>>>>
>>>>>
>>>>>
>>>>> * p2m_pod_demand_populate()
>>>>>
>>>>> The trick of this function is to do remap_and_retry if
>>>>>           
>>>> p2m_pod_cache_get()
>>>>         
>>>>> fails. When p2m_pod_get() fails, this function will 
>>>>>           
>> splits p2m table
>>     
>>>> entry
>>>>         
>>>>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB).
That can
>>>>>           
>>>> guarantee
>>>>         
>>>>> populate demands always work.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xensource.com
>>>>> http://lists.xensource.com/xen-devel
>>>>>
>>>>>
>>>>>           
>>     

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2009-Mar-20 13:40 UTC

head link

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Interesting.  And non-intuitive.  I think you are saying
that, at least theoretically (and using your ABCD, not
my ABC below), A is always faster than
(B | C), and (B | C) is always faster than D.  Taking into
account the fact that the TLB size is fixed (I think),
C will always be faster than B and never slower than D.

So if the theory proves true, that does seem to eliminate
my objection.

Thanks,
Dan
> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Sent: Friday, March 20, 2009 3:46 AM
> To: Dan Magenheimer
> Cc: Wei Huang; xen-devel@lists.xensource.com; Keir Fraser; Tim Deegan
> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
> 
> 
> Dan,
> 
> Don''t forget that this is about the p2m table, which is (if I 
> understand 
> correctly) orthogonal to what the guest pagetables are doing.  So the 
> scenario, if HAP is used, would be:
> 
> A) DB code uses 2MB pages in guest PTs, OS assumes 2MB pages, 
> guest PTs 
> use 2MB pages, P2M uses 2MB pages
>  - A tlb miss requires 3 * 3 = 9 reads (Assuming 64-bit guest)
> B) DB code uses 2MB pages, OS uses 2MB pages, p2m uses 4K pages
>  - A tlb miss requires 3 * 4 = 12 reads
> C) DB code uses 4k pages, OS uses 4k pages, p2m uses 2MB pages
>  - A tlb miss requires 4 * 3 = 12 reads
> D) DB code uses 4k pages, OS uses 4k pages, p2m uses 4k pages
>  - A tlb miss requires 4 * 4 = 16 reads
> 
> And adding the 1G p2m entries will change the multiplier from 3 to 2 
> (i.e., 3*2 = 6 reads for superpages, 4*2 = 8 reads for 4k 
> guest pages).
> 
> (Those who are more familiar with the hardware, please correct me if 
> I''ve made some mistakes or oversimplified things.)
> 
> So adding 1G pages to the p2m table shouldn''t change 
> expectations of the 
> guest OS in any case.  Using it will benefit the guest to the same 
> degree whether the guest is using 4k, 2Mb, or 1G pages. (If I 
> understand 
> correctly.)
> 
>  -George
> 
> Dan Magenheimer wrote:
> > Hi Wei --
> >
> > I''m not worried about the overhead of the splintering,
I''m
> > worried about the "hidden overhead" everytime a "silent
> > splinter" is used.
> >
> > Let''s assume three scenarios (and for now use 2MB pages
though
> > the same concerns can be extended to 1GB and/or mixed 2MB/1GB):
> >
> > A) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
> >    only 2MB pages (no splintering occurs)
> > B) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
> >    only 4KB pages (because of fragmentation, all 2MB pages have
> >    been splintered)
> > C) DB code assumes 4KB pages, OS assumes 4KB pages, Xen provides
> >    4KB pages
> >
> > Now run some benchmarks.  Clearly one would assume that A is
> > faster than both B and C.  The question is: Is B faster or slower
> > than C?
> >
> > If B is always faster than C, then I have less objection to
> > "silent splintering".  But if B is sometimes (or maybe
always?)
> > slower than C, that''s a big issue because a user has gone
through
> > the effort of choosing a better-performing system configuration
> > for their software (2MB DB on 2MB OS), but it actually performs
> > worse than if they had chosen the "lower performing"
configuration.
> > And, worse, it will likely degrade across time so performance
> > might be fine when the 2MB-DB-on-2MB-OS guest is launched
> > but get much worse when it is paused, save/restored, migrated,
> > or hot-failed.  So even if B is only slightly faster than C,
> > if B is much slower than A, this is a problem.
> >
> > Does that make sense?
> >
> > Some suggestions:
> > 1) If it is possible for an administrator to determine how many
> >    large pages (both 2MB and 1GB) were requested by each domain
> >    and how many are currently whole-vs-splintered, that would help.
> > 2) We may need some form of memory defragmenter
> >
> >   
> >> -----Original Message-----
> >> From: Wei Huang [mailto:wei.huang2@amd.com]
> >> Sent: Thursday, March 19, 2009 12:52 PM
> >> To: Dan Magenheimer
> >> Cc: George Dunlap; xen-devel@lists.xensource.com;
> >> keir.fraser@eu.citrix.com; Tim Deegan
> >> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
> >>
> >>
> >> Dan,
> >>
> >> Thanks for your comments. I am not sure about which 
> >> splintering overhead 
> >> you are referring to. I can think of three areas:
> >>
> >> 1. splintering in page allocation
> >> In this case, Xen fails to allocate requested page order. 
> So it falls 
> >> back to smaller pages to setup p2m table. The overhead is 
> >> O(guest_mem_size), which is a one-time deal.
> >>
> >> 2. P2M splits large page into smaller pages
> >> This is one directional because we don''t merge smaller 
> pages to large 
> >> ones. The worst case is to split all guest large pages. So 
> >> overhead is 
> >> O(total_large_page_mem). In long run, the overhead will 
> converge to 0 
> >> because it is one-directional. Note this overhead also covers 
> >> when PoD 
> >> feature is enabled.
> >>
> >> 3. CPU splintering
> >> If CPU does not support 1GB page, it automatically does 
> splintering 
> >> using smaller ones (such as 2MB). In this case, the overhead 
> >> is always 
> >> there. But 1) this only happens to a small number of old 
> chips; 2) I 
> >> believe that it is still faster than 4K pages. CPUID (1gb 
> feature and 
> >> 1gb TLB entries) can be used to detect and stop this 
> problem, if we 
> >> don''t really like it.
> >>
> >> I agree on your concerns. Customers should have the right to 
> >> make their 
> >> own decision. But that require new feature is enabled in the first
> >> place. For a lot of benchmarks, splintering overhead can be 
> >> offset with 
> >> benefits of huge pages. SPECJBB is a good example of using 
> >> large pages 
> >> (see Ben Serebrin''s presentation in Xen Summit). With
that
> >> said, I agree 
> >> with the idea of adding a new option in guest configure file.
> >>
> >> -Wei
> >>
> >>
> >> Dan Magenheimer wrote:
> >>     
> >>> I''d like to reiterate my argument raised in a
previous
> >>> discussion of hugepages:  Just because this CAN be made
> >>> to work, doesn''t imply that it SHOULD be made to
work.
> >>> Real users use larger pages in their OS for the sole
> >>> reason that they expect a performance improvement.
> >>> If it magically works, but works slow (and possibly
> >>> slower than if the OS had just used small pages to
> >>> start with), this is likely to lead to unsatisfied
> >>> customers, and perhaps allegations such as "Xen sucks
> >>> when running databases".
> >>>
> >>> So, please, let''s think this through before
implementing
> >>> it just because we can.  At a minimum, an administrator
> >>> should be somehow warned if large pages are getting
splintered.
> >>>
> >>> And if its going in over my objection, please tie it to
> >>> a boot option that defaults off so administrator action
> >>> is required to allow silent splintering.
> >>>
> >>> My two cents...
> >>> Dan
> >>>
> >>>       
> >>>> -----Original Message-----
> >>>> From: Huang2, Wei [mailto:Wei.Huang2@amd.com]
> >>>> Sent: Thursday, March 19, 2009 2:07 AM
> >>>> To: George Dunlap
> >>>> Cc: xen-devel@lists.xensource.com;
keir.fraser@eu.citrix.com;
> >>>> Tim Deegan
> >>>> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page 
> Table Support
> >>>>
> >>>>
> >>>> Here are patches using the middle approach. It handles 1GB
> >>>> pages in PoD
> >>>> by remapping 1GB with 2MB pages & retry. I also added 
> code for 1GB
> >>>> detection. Please comment.
> >>>>
> >>>> Thanks a lot,
> >>>>
> >>>> -Wei
> >>>>
> >>>> -----Original Message-----
> >>>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On 
> >>>>         
> >> Behalf Of George
> >>     
> >>>> Dunlap
> >>>> Sent: Wednesday, March 18, 2009 12:20 PM
> >>>> To: Huang2, Wei
> >>>> Cc: xen-devel@lists.xensource.com;
keir.fraser@eu.citrix.com;
> >>>> Tim Deegan
> >>>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page 
> Table Support
> >>>>
> >>>> Thanks for doing this work, Wei -- especially all the 
> >>>>         
> >> extra effort for
> >>     
> >>>> the PoD integration.
> >>>>
> >>>> One question: How well would you say you''ve
tested the PoD
> >>>> functionality?  Or to put it the other way, how much do I
need to
> >>>> prioritize testing this before the 3.4 release?
> >>>>
> >>>> It wouldn''t be a bad idea to do as you suggested,
and
> break things
> >>>> into 2 meg pages for the PoD case.  In order to take the
best
> >>>> advantage of this in a PoD scenario, you''d need
to have a balloon
> >>>> driver that could allocate 1G of continuous *guest* p2m 
> >>>>         
> >> space, which
> >>     
> >>>> seems a bit optimistic at this point...
> >>>>
> >>>>  -George
> >>>>
> >>>> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
> >>>>         
> >>>>> Current Xen supports 2MB super pages for NPT/EPT. The 
> >>>>>           
> >>>> attached patches
> >>>>         
> >>>>> extend this feature to support 1GB pages. The PoD 
> >>>>>           
> >>>> (populate-on-demand)
> >>>>         
> >>>>> introduced by George Dunlap made P2M modification
harder.
> >>>>>           
> >> I tried to
> >>     
> >>>>> preserve existing PoD design by introducing a 1GB PoD 
> cache list.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Note that 1GB PoD can be dropped if we don''t
care about
> >>>>>           
> >> 1GB when PoD
> >>     
> >>>> is
> >>>>         
> >>>>> enabled. In this case, we can just split 1GB PDPE into
> 512x2MB PDE
> >>>>>           
> >>>> entries
> >>>>         
> >>>>> and grab pages from PoD super list. That can pretty
much make
> >>>>> 1gb_p2m_pod.patch go away.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Any comment/suggestion on design idea will be
appreciated.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>>
> >>>>>
> >>>>> -Wei
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> The following is the description:
> >>>>>
> >>>>> === 1gb_tools.patch ==> >>>>>
> >>>>> Extend existing setup_guest() function. Basically, it
tries to
> >>>>>           
> >>>> allocate 1GB
> >>>>         
> >>>>> pages whenever available. If this request fails, it
falls
> >>>>>           
> >>>> back to 2MB.
> >>>> If
> >>>>         
> >>>>> both fail, then 4KB pages will be used.
> >>>>>
> >>>>>
> >>>>>
> >>>>> === 1gb_p2m.patch ==> >>>>>
> >>>>> * p2m_next_level()
> >>>>>
> >>>>> Check PSE bit of L3 page table entry. If 1GB is found 
> (PSE=1), we
> >>>>>           
> >>>> split 1GB
> >>>>         
> >>>>> into 512 2MB pages.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_set_entry()
> >>>>>
> >>>>> Configure the PSE bit of L3 P2M table if page order ==
18 (1GB).
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_gfn_to_mfn()
> >>>>>
> >>>>> Add support for 1GB case when doing gfn to mfn 
> >>>>>           
> >> translation. When L3
> >>     
> >>>> entry is
> >>>>         
> >>>>> marked as POPULATE_ON_DEMAND, we call
2m_pod_demand_populate().
> >>>>>           
> >>>> Otherwise,
> >>>>         
> >>>>> we do the regular address translation (gfn ==>
mfn).
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_gfn_to_mfn_current()
> >>>>>
> >>>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s
marked as
> >>>>> POPULATE_ON_DEMAND, it demands a populate using
> >>>>>           
> >>>> p2m_pod_demand_populate().
> >>>>         
> >>>>> Otherwise, it does a normal translation. 1GB page is
taken into
> >>>>> consideration.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * set_p2m_entry()
> >>>>>
> >>>>> Request 1GB page
> >>>>>
> >>>>>
> >>>>>
> >>>>> * audit_p2m()
> >>>>>
> >>>>> Support 1GB while auditing p2m table.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_change_type_global()
> >>>>>
> >>>>> Deal with 1GB page when changing global page type.
> >>>>>
> >>>>>
> >>>>>
> >>>>> === 1gb_p2m_pod.patch ==> >>>>>
> >>>>> * xen/include/asm-x86/p2m.h
> >>>>>
> >>>>> Minor change to deal with PoD. It separates super page
> >>>>>           
> >>>> cache list into
> >>>> 2MB
> >>>>         
> >>>>> and 1GB lists. Similarly, we record last gpfn of
sweeping
> >>>>>           
> >>>> for both 2MB
> >>>> and
> >>>>         
> >>>>> 1GB.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_cache_add()
> >>>>>
> >>>>> Check page order and add 1GB super page into PoD 1GB
cache list.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_cache_get()
> >>>>>
> >>>>> Grab a page from cache list. It tries to break 1GB
page
> >>>>>           
> >> into 512 2MB
> >>     
> >>>> pages
> >>>>         
> >>>>> if 2MB PoD list is empty. Similarly, 4KB can be
requested
> >>>>>           
> >> from super
> >>     
> >>>> pages.
> >>>>         
> >>>>> The breaking order is 2MB then 1GB.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_cache_target()
> >>>>>
> >>>>> This function is used to set PoD cache size. To
increase
> >>>>>           
> >> PoD target,
> >>     
> >>>> we try
> >>>>         
> >>>>> to allocate 1GB from xen domheap. If this fails, we
try
> >>>>>           
> >> 2MB. If both
> >>     
> >>>> fail,
> >>>>         
> >>>>> we try 4KB which is guaranteed to work.
> >>>>>
> >>>>>
> >>>>>
> >>>>> To decrease the target, we use a similar approach. We 
> first try to
> >>>>>           
> >>>> free 1GB
> >>>>         
> >>>>> pages from 1GB PoD cache list. If such request fails,
we
> >>>>>           
> >> try 2MB PoD
> >>     
> >>>> cache
> >>>>         
> >>>>> list. If both fail, we try 4KB list.
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_zero_check_superpage_1gb()
> >>>>>
> >>>>> This adds a new function to check for 1GB page. This
function is
> >>>>>           
> >>>> similar to
> >>>>         
> >>>>> p2m_pod_zero_check_superpage_2mb().
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_zero_check_superpage_1gb()
> >>>>>
> >>>>> We add a new function to sweep 1GB page from guest
memory.
> >>>>>           
> >>>> This is the
> >>>> same
> >>>>         
> >>>>> as p2m_pod_zero_check_superpage_2mb().
> >>>>>
> >>>>>
> >>>>>
> >>>>> * p2m_pod_demand_populate()
> >>>>>
> >>>>> The trick of this function is to do remap_and_retry if
> >>>>>           
> >>>> p2m_pod_cache_get()
> >>>>         
> >>>>> fails. When p2m_pod_get() fails, this function will 
> >>>>>           
> >> splits p2m table
> >>     
> >>>> entry
> >>>>         
> >>>>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==>
4KB). That can
> >>>>>           
> >>>> guarantee
> >>>>         
> >>>>> populate demands always work.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Xen-devel mailing list
> >>>>> Xen-devel@lists.xensource.com
> >>>>> http://lists.xensource.com/xen-devel
> >>>>>
> >>>>>
> >>>>>           
> >>     
> 
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei Huang

2009-Mar-20 13:56 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Dan,

I agree on the order: A > C >= B > D. Generally, super page should 
perform better than small pages.

In reality, the difference between B & C is subtle. It depends on how 
TLB cache is designed and whether TLB flush happens frequently.

-Wei

Dan Magenheimer wrote:> Interesting.  And non-intuitive.  I think you are saying
> that, at least theoretically (and using your ABCD, not
> my ABC below), A is always faster than
> (B | C), and (B | C) is always faster than D.  Taking into
> account the fact that the TLB size is fixed (I think),
> C will always be faster than B and never slower than D.
> 
> So if the theory proves true, that does seem to eliminate
> my objection.
> 
> Thanks,
> Dan
> 
>> -----Original Message-----
>> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
>> Sent: Friday, March 20, 2009 3:46 AM
>> To: Dan Magenheimer
>> Cc: Wei Huang; xen-devel@lists.xensource.com; Keir Fraser; Tim Deegan
>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
>>
>>
>> Dan,
>>
>> Don''t forget that this is about the p2m table, which is (if I 
>> understand 
>> correctly) orthogonal to what the guest pagetables are doing.  So the 
>> scenario, if HAP is used, would be:
>>
>> A) DB code uses 2MB pages in guest PTs, OS assumes 2MB pages, 
>> guest PTs 
>> use 2MB pages, P2M uses 2MB pages
>>  - A tlb miss requires 3 * 3 = 9 reads (Assuming 64-bit guest)
>> B) DB code uses 2MB pages, OS uses 2MB pages, p2m uses 4K pages
>>  - A tlb miss requires 3 * 4 = 12 reads
>> C) DB code uses 4k pages, OS uses 4k pages, p2m uses 2MB pages
>>  - A tlb miss requires 4 * 3 = 12 reads
>> D) DB code uses 4k pages, OS uses 4k pages, p2m uses 4k pages
>>  - A tlb miss requires 4 * 4 = 16 reads
>>
>> And adding the 1G p2m entries will change the multiplier from 3 to 2 
>> (i.e., 3*2 = 6 reads for superpages, 4*2 = 8 reads for 4k 
>> guest pages).
>>
>> (Those who are more familiar with the hardware, please correct me if 
>> I''ve made some mistakes or oversimplified things.)
>>
>> So adding 1G pages to the p2m table shouldn''t change 
>> expectations of the 
>> guest OS in any case.  Using it will benefit the guest to the same 
>> degree whether the guest is using 4k, 2Mb, or 1G pages. (If I 
>> understand 
>> correctly.)
>>
>>  -George
>>
>> Dan Magenheimer wrote:
>>> Hi Wei --
>>>
>>> I''m not worried about the overhead of the splintering,
I''m
>>> worried about the "hidden overhead" everytime a
"silent
>>> splinter" is used.
>>>
>>> Let''s assume three scenarios (and for now use 2MB pages
though
>>> the same concerns can be extended to 1GB and/or mixed 2MB/1GB):
>>>
>>> A) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
>>>    only 2MB pages (no splintering occurs)
>>> B) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides
>>>    only 4KB pages (because of fragmentation, all 2MB pages have
>>>    been splintered)
>>> C) DB code assumes 4KB pages, OS assumes 4KB pages, Xen provides
>>>    4KB pages
>>>
>>> Now run some benchmarks.  Clearly one would assume that A is
>>> faster than both B and C.  The question is: Is B faster or slower
>>> than C?
>>>
>>> If B is always faster than C, then I have less objection to
>>> "silent splintering".  But if B is sometimes (or maybe
always?)
>>> slower than C, that''s a big issue because a user has gone
through
>>> the effort of choosing a better-performing system configuration
>>> for their software (2MB DB on 2MB OS), but it actually performs
>>> worse than if they had chosen the "lower performing"
configuration.
>>> And, worse, it will likely degrade across time so performance
>>> might be fine when the 2MB-DB-on-2MB-OS guest is launched
>>> but get much worse when it is paused, save/restored, migrated,
>>> or hot-failed.  So even if B is only slightly faster than C,
>>> if B is much slower than A, this is a problem.
>>>
>>> Does that make sense?
>>>
>>> Some suggestions:
>>> 1) If it is possible for an administrator to determine how many
>>>    large pages (both 2MB and 1GB) were requested by each domain
>>>    and how many are currently whole-vs-splintered, that would help.
>>> 2) We may need some form of memory defragmenter
>>>
>>>   
>>>> -----Original Message-----
>>>> From: Wei Huang [mailto:wei.huang2@amd.com]
>>>> Sent: Thursday, March 19, 2009 12:52 PM
>>>> To: Dan Magenheimer
>>>> Cc: George Dunlap; xen-devel@lists.xensource.com;
>>>> keir.fraser@eu.citrix.com; Tim Deegan
>>>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table
Support
>>>>
>>>>
>>>> Dan,
>>>>
>>>> Thanks for your comments. I am not sure about which 
>>>> splintering overhead 
>>>> you are referring to. I can think of three areas:
>>>>
>>>> 1. splintering in page allocation
>>>> In this case, Xen fails to allocate requested page order. 
>> So it falls 
>>>> back to smaller pages to setup p2m table. The overhead is 
>>>> O(guest_mem_size), which is a one-time deal.
>>>>
>>>> 2. P2M splits large page into smaller pages
>>>> This is one directional because we don''t merge smaller
>> pages to large 
>>>> ones. The worst case is to split all guest large pages. So 
>>>> overhead is 
>>>> O(total_large_page_mem). In long run, the overhead will 
>> converge to 0 
>>>> because it is one-directional. Note this overhead also covers 
>>>> when PoD 
>>>> feature is enabled.
>>>>
>>>> 3. CPU splintering
>>>> If CPU does not support 1GB page, it automatically does 
>> splintering 
>>>> using smaller ones (such as 2MB). In this case, the overhead 
>>>> is always 
>>>> there. But 1) this only happens to a small number of old 
>> chips; 2) I 
>>>> believe that it is still faster than 4K pages. CPUID (1gb 
>> feature and 
>>>> 1gb TLB entries) can be used to detect and stop this 
>> problem, if we 
>>>> don''t really like it.
>>>>
>>>> I agree on your concerns. Customers should have the right to 
>>>> make their 
>>>> own decision. But that require new feature is enabled in the
first
>>>> place. For a lot of benchmarks, splintering overhead can be 
>>>> offset with 
>>>> benefits of huge pages. SPECJBB is a good example of using 
>>>> large pages 
>>>> (see Ben Serebrin''s presentation in Xen Summit). With
that
>>>> said, I agree 
>>>> with the idea of adding a new option in guest configure file.
>>>>
>>>> -Wei
>>>>
>>>>
>>>> Dan Magenheimer wrote:
>>>>     
>>>>> I''d like to reiterate my argument raised in a
previous
>>>>> discussion of hugepages:  Just because this CAN be made
>>>>> to work, doesn''t imply that it SHOULD be made to
work.
>>>>> Real users use larger pages in their OS for the sole
>>>>> reason that they expect a performance improvement.
>>>>> If it magically works, but works slow (and possibly
>>>>> slower than if the OS had just used small pages to
>>>>> start with), this is likely to lead to unsatisfied
>>>>> customers, and perhaps allegations such as "Xen sucks
>>>>> when running databases".
>>>>>
>>>>> So, please, let''s think this through before
implementing
>>>>> it just because we can.  At a minimum, an administrator
>>>>> should be somehow warned if large pages are getting
splintered.
>>>>>
>>>>> And if its going in over my objection, please tie it to
>>>>> a boot option that defaults off so administrator action
>>>>> is required to allow silent splintering.
>>>>>
>>>>> My two cents...
>>>>> Dan
>>>>>
>>>>>       
>>>>>> -----Original Message-----
>>>>>> From: Huang2, Wei [mailto:Wei.Huang2@amd.com]
>>>>>> Sent: Thursday, March 19, 2009 2:07 AM
>>>>>> To: George Dunlap
>>>>>> Cc: xen-devel@lists.xensource.com;
keir.fraser@eu.citrix.com;
>>>>>> Tim Deegan
>>>>>> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page 
>> Table Support
>>>>>>
>>>>>> Here are patches using the middle approach. It handles
1GB
>>>>>> pages in PoD
>>>>>> by remapping 1GB with 2MB pages & retry. I also
added
>> code for 1GB
>>>>>> detection. Please comment.
>>>>>>
>>>>>> Thanks a lot,
>>>>>>
>>>>>> -Wei
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On 
>>>>>>         
>>>> Behalf Of George
>>>>     
>>>>>> Dunlap
>>>>>> Sent: Wednesday, March 18, 2009 12:20 PM
>>>>>> To: Huang2, Wei
>>>>>> Cc: xen-devel@lists.xensource.com;
keir.fraser@eu.citrix.com;
>>>>>> Tim Deegan
>>>>>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page 
>> Table Support
>>>>>> Thanks for doing this work, Wei -- especially all the 
>>>>>>         
>>>> extra effort for
>>>>     
>>>>>> the PoD integration.
>>>>>>
>>>>>> One question: How well would you say you''ve
tested the PoD
>>>>>> functionality?  Or to put it the other way, how much do
I need to
>>>>>> prioritize testing this before the 3.4 release?
>>>>>>
>>>>>> It wouldn''t be a bad idea to do as you
suggested, and
>> break things
>>>>>> into 2 meg pages for the PoD case.  In order to take
the best
>>>>>> advantage of this in a PoD scenario, you''d
need to have a balloon
>>>>>> driver that could allocate 1G of continuous *guest* p2m
>>>>>>         
>>>> space, which
>>>>     
>>>>>> seems a bit optimistic at this point...
>>>>>>
>>>>>>  -George
>>>>>>
>>>>>> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>>>>>>         
>>>>>>> Current Xen supports 2MB super pages for NPT/EPT.
The
>>>>>>>           
>>>>>> attached patches
>>>>>>         
>>>>>>> extend this feature to support 1GB pages. The PoD 
>>>>>>>           
>>>>>> (populate-on-demand)
>>>>>>         
>>>>>>> introduced by George Dunlap made P2M modification
harder.
>>>>>>>           
>>>> I tried to
>>>>     
>>>>>>> preserve existing PoD design by introducing a 1GB
PoD
>> cache list.
>>>>>>>
>>>>>>>
>>>>>>> Note that 1GB PoD can be dropped if we
don''t care about
>>>>>>>           
>>>> 1GB when PoD
>>>>     
>>>>>> is
>>>>>>         
>>>>>>> enabled. In this case, we can just split 1GB PDPE
into
>> 512x2MB PDE
>>>>>>>           
>>>>>> entries
>>>>>>         
>>>>>>> and grab pages from PoD super list. That can pretty
much make
>>>>>>> 1gb_p2m_pod.patch go away.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Any comment/suggestion on design idea will be
appreciated.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Wei
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The following is the description:
>>>>>>>
>>>>>>> === 1gb_tools.patch ==>>>>>>>
>>>>>>> Extend existing setup_guest() function. Basically,
it tries to
>>>>>>>           
>>>>>> allocate 1GB
>>>>>>         
>>>>>>> pages whenever available. If this request fails, it
falls
>>>>>>>           
>>>>>> back to 2MB.
>>>>>> If
>>>>>>         
>>>>>>> both fail, then 4KB pages will be used.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> === 1gb_p2m.patch ==>>>>>>>
>>>>>>> * p2m_next_level()
>>>>>>>
>>>>>>> Check PSE bit of L3 page table entry. If 1GB is
found
>> (PSE=1), we
>>>>>>>           
>>>>>> split 1GB
>>>>>>         
>>>>>>> into 512 2MB pages.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_set_entry()
>>>>>>>
>>>>>>> Configure the PSE bit of L3 P2M table if page order
== 18 (1GB).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_gfn_to_mfn()
>>>>>>>
>>>>>>> Add support for 1GB case when doing gfn to mfn 
>>>>>>>           
>>>> translation. When L3
>>>>     
>>>>>> entry is
>>>>>>         
>>>>>>> marked as POPULATE_ON_DEMAND, we call
2m_pod_demand_populate().
>>>>>>>           
>>>>>> Otherwise,
>>>>>>         
>>>>>>> we do the regular address translation (gfn ==>
mfn).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_gfn_to_mfn_current()
>>>>>>>
>>>>>>> This is similar to p2m_gfn_to_mfn(). When L3 entry
s marked as
>>>>>>> POPULATE_ON_DEMAND, it demands a populate using
>>>>>>>           
>>>>>> p2m_pod_demand_populate().
>>>>>>         
>>>>>>> Otherwise, it does a normal translation. 1GB page
is taken into
>>>>>>> consideration.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * set_p2m_entry()
>>>>>>>
>>>>>>> Request 1GB page
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * audit_p2m()
>>>>>>>
>>>>>>> Support 1GB while auditing p2m table.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_change_type_global()
>>>>>>>
>>>>>>> Deal with 1GB page when changing global page type.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> === 1gb_p2m_pod.patch
==>>>>>>>
>>>>>>> * xen/include/asm-x86/p2m.h
>>>>>>>
>>>>>>> Minor change to deal with PoD. It separates super
page
>>>>>>>           
>>>>>> cache list into
>>>>>> 2MB
>>>>>>         
>>>>>>> and 1GB lists. Similarly, we record last gpfn of
sweeping
>>>>>>>           
>>>>>> for both 2MB
>>>>>> and
>>>>>>         
>>>>>>> 1GB.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_pod_cache_add()
>>>>>>>
>>>>>>> Check page order and add 1GB super page into PoD
1GB cache list.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_pod_cache_get()
>>>>>>>
>>>>>>> Grab a page from cache list. It tries to break 1GB
page
>>>>>>>           
>>>> into 512 2MB
>>>>     
>>>>>> pages
>>>>>>         
>>>>>>> if 2MB PoD list is empty. Similarly, 4KB can be
requested
>>>>>>>           
>>>> from super
>>>>     
>>>>>> pages.
>>>>>>         
>>>>>>> The breaking order is 2MB then 1GB.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_pod_cache_target()
>>>>>>>
>>>>>>> This function is used to set PoD cache size. To
increase
>>>>>>>           
>>>> PoD target,
>>>>     
>>>>>> we try
>>>>>>         
>>>>>>> to allocate 1GB from xen domheap. If this fails, we
try
>>>>>>>           
>>>> 2MB. If both
>>>>     
>>>>>> fail,
>>>>>>         
>>>>>>> we try 4KB which is guaranteed to work.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> To decrease the target, we use a similar approach.
We
>> first try to
>>>>>>>           
>>>>>> free 1GB
>>>>>>         
>>>>>>> pages from 1GB PoD cache list. If such request
fails, we
>>>>>>>           
>>>> try 2MB PoD
>>>>     
>>>>>> cache
>>>>>>         
>>>>>>> list. If both fail, we try 4KB list.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_pod_zero_check_superpage_1gb()
>>>>>>>
>>>>>>> This adds a new function to check for 1GB page.
This function is
>>>>>>>           
>>>>>> similar to
>>>>>>         
>>>>>>> p2m_pod_zero_check_superpage_2mb().
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_pod_zero_check_superpage_1gb()
>>>>>>>
>>>>>>> We add a new function to sweep 1GB page from guest
memory.
>>>>>>>           
>>>>>> This is the
>>>>>> same
>>>>>>         
>>>>>>> as p2m_pod_zero_check_superpage_2mb().
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> * p2m_pod_demand_populate()
>>>>>>>
>>>>>>> The trick of this function is to do remap_and_retry
if
>>>>>>>           
>>>>>> p2m_pod_cache_get()
>>>>>>         
>>>>>>> fails. When p2m_pod_get() fails, this function will
>>>>>>>           
>>>> splits p2m table
>>>>     
>>>>>> entry
>>>>>>         
>>>>>>> into smaller ones (e.g. 1GB ==> 2MB or 2MB
==> 4KB). That can
>>>>>>>           
>>>>>> guarantee
>>>>>>         
>>>>>>> populate demands always work.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@lists.xensource.com
>>>>>>> http://lists.xensource.com/xen-devel
>>>>>>>
>>>>>>>
>>>>>>>           
>>>>     
>>
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gianluca Guida

2009-Mar-20 13:59 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

On Fri, Mar 20, 2009 at 2:40 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:> Interesting.  And non-intuitive.  I think you are saying
> that, at least theoretically (and using your ABCD, not
> my ABC below), A is always faster than
> (B | C), and (B | C) is always faster than D.  Taking into
> account the fact that the TLB size is fixed (I think),
> C will always be faster than B and never slower than D.
Changing the p2m of course won''t help in the shadow pagetables case. :/
The ''good'' news is that I *think* after recent measures that
in
current shadow pagetables a better utilization of the TLB entries
(e.g. setting the PSE bits in the shadows, which is possible after the
2Mb p2m entries) per se doesn''t affects performance *at all*.

I will soon have a very small patch that would, if my current analysis
is correct, improve performance in applications using 2Mb pages
without setting PSE in shadows. I''ll just need some people helping me
with testing it on these kind of applications.

Thanks,
Gianluca

-- 
It was a type of people I did not know, I found them very strange and
they did not inspire confidence at all. Later I learned that I had been
introduced to electronic engineers.
                                                  E. W. Dijkstra

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gianluca Guida

2009-Mar-20 14:21 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

George Dunlap wrote:> Gianluca Guida wrote:
>> Changing the p2m of course won''t help in the shadow pagetables
case. :/
>>   
> No, but it seems to me that 1G p2m entries can''t hurt performance
for
> shadows either, which is the question Dan was raising.
Yes, I am sure they won''t. They''ll make the p2m walk faster,
which is
good, but wont'' affect shadow performance in a significant way.
> It might be nice to have some quick "sanity check" benchmarks,
just to
> make sure that adding the 1G pages won''t hurt performance for the
normal
> case...
Might do it, but I don''t think it''s necessary.

Thanks,
Gianluca

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2009-Mar-20 14:25 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Gianluca Guida wrote:> Changing the p2m of course won''t help in the shadow pagetables
case. :/
>   No, but it seems to me that 1G p2m entries can''t hurt performance for 
shadows either, which is the question Dan was raising.

It might be nice to have some quick "sanity check" benchmarks, just to
make sure that adding the 1G pages won''t hurt performance for the
normal
case...

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gianluca Guida

2009-Mar-20 18:16 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Gianluca Guida wrote:> George Dunlap wrote:
>> It might be nice to have some quick "sanity check"
benchmarks, just to
>> make sure that adding the 1G pages won''t hurt performance for
the normal
>> case...
> 
> Might do it, but I don''t think it''s necessary.
What I meant here is: checking that shadow performance hasn''t decreased
shouldn''t be necessary. It should be still worth, though, if
we''d get
the kind of improvements this patch has on HAP.

Any numbers?

Gianluca

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Huang2, Wei

2009-Mar-20 18:32 UTC

head link

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

I am working on collecting the numbers. Dan, does Oracle have any
database benchmarks?

-Wei

-----Original Message-----
From: Gianluca Guida [mailto:gianluca.guida@eu.citrix.com] 
Sent: Friday, March 20, 2009 1:17 PM
To: George Dunlap
Cc: Gianluca Guida; Dan Magenheimer; Tim Deegan; Huang2, Wei;
xen-devel@lists.xensource.com; Keir Fraser
Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Gianluca Guida wrote:> George Dunlap wrote:
>> It might be nice to have some quick "sanity check"
benchmarks, just
to >> make sure that adding the 1G pages won''t hurt performance for
the
normal >> case...
> 
> Might do it, but I don''t think it''s necessary.
What I meant here is: checking that shadow performance hasn''t decreased
shouldn''t be necessary. It should be still worth, though, if
we''d get
the kind of improvements this patch has on HAP.

Any numbers?

Gianluca



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-May-19 00:55 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Now that 3.4 is out, if you want these patches checked in then please
re-send with changeset comments and signed-off-by lines.

 Thanks,
 Keir

On 19/03/2009 08:07, "Huang2, Wei" <Wei.Huang2@amd.com> wrote:
> Here are patches using the middle approach. It handles 1GB pages in PoD
> by remapping 1GB with 2MB pages & retry. I also added code for 1GB
> detection. Please comment.
> 
> Thanks a lot,
> 
> -Wei
> 
> -----Original Message-----
> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
> Dunlap
> Sent: Wednesday, March 18, 2009 12:20 PM
> To: Huang2, Wei
> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; Tim Deegan
> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
> 
> Thanks for doing this work, Wei -- especially all the extra effort for
> the PoD integration.
> 
> One question: How well would you say you''ve tested the PoD
> functionality?  Or to put it the other way, how much do I need to
> prioritize testing this before the 3.4 release?
> 
> It wouldn''t be a bad idea to do as you suggested, and break things
> into 2 meg pages for the PoD case.  In order to take the best
> advantage of this in a PoD scenario, you''d need to have a balloon
> driver that could allocate 1G of continuous *guest* p2m space, which
> seems a bit optimistic at this point...
> 
>  -George
> 
> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>> Current Xen supports 2MB super pages for NPT/EPT. The attached patches
>> extend this feature to support 1GB pages. The PoD (populate-on-demand)
>> introduced by George Dunlap made P2M modification harder. I tried to
>> preserve existing PoD design by introducing a 1GB PoD cache list.
>> 
>> 
>> 
>> Note that 1GB PoD can be dropped if we don''t care about 1GB
when PoD
> is
>> enabled. In this case, we can just split 1GB PDPE into 512x2MB PDE
> entries
>> and grab pages from PoD super list. That can pretty much make
>> 1gb_p2m_pod.patch go away.
>> 
>> 
>> 
>> Any comment/suggestion on design idea will be appreciated.
>> 
>> 
>> 
>> Thanks,
>> 
>> 
>> 
>> -Wei
>> 
>> 
>> 
>> 
>> 
>> The following is the description:
>> 
>> === 1gb_tools.patch ==>> 
>> Extend existing setup_guest() function. Basically, it tries to
> allocate 1GB
>> pages whenever available. If this request fails, it falls back to 2MB.
> If
>> both fail, then 4KB pages will be used.
>> 
>> 
>> 
>> === 1gb_p2m.patch ==>> 
>> * p2m_next_level()
>> 
>> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we
> split 1GB
>> into 512 2MB pages.
>> 
>> 
>> 
>> * p2m_set_entry()
>> 
>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
>> 
>> 
>> 
>> * p2m_gfn_to_mfn()
>> 
>> Add support for 1GB case when doing gfn to mfn translation. When L3
> entry is
>> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().
> Otherwise,
>> we do the regular address translation (gfn ==> mfn).
>> 
>> 
>> 
>> * p2m_gfn_to_mfn_current()
>> 
>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
>> POPULATE_ON_DEMAND, it demands a populate using
> p2m_pod_demand_populate().
>> Otherwise, it does a normal translation. 1GB page is taken into
>> consideration.
>> 
>> 
>> 
>> * set_p2m_entry()
>> 
>> Request 1GB page
>> 
>> 
>> 
>> * audit_p2m()
>> 
>> Support 1GB while auditing p2m table.
>> 
>> 
>> 
>> * p2m_change_type_global()
>> 
>> Deal with 1GB page when changing global page type.
>> 
>> 
>> 
>> === 1gb_p2m_pod.patch ==>> 
>> * xen/include/asm-x86/p2m.h
>> 
>> Minor change to deal with PoD. It separates super page cache list into
> 2MB
>> and 1GB lists. Similarly, we record last gpfn of sweeping for both 2MB
> and
>> 1GB.
>> 
>> 
>> 
>> * p2m_pod_cache_add()
>> 
>> Check page order and add 1GB super page into PoD 1GB cache list.
>> 
>> 
>> 
>> * p2m_pod_cache_get()
>> 
>> Grab a page from cache list. It tries to break 1GB page into 512 2MB
> pages
>> if 2MB PoD list is empty. Similarly, 4KB can be requested from super
> pages.
>> The breaking order is 2MB then 1GB.
>> 
>> 
>> 
>> * p2m_pod_cache_target()
>> 
>> This function is used to set PoD cache size. To increase PoD target,
> we try
>> to allocate 1GB from xen domheap. If this fails, we try 2MB. If both
> fail,
>> we try 4KB which is guaranteed to work.
>> 
>> 
>> 
>> To decrease the target, we use a similar approach. We first try to
> free 1GB
>> pages from 1GB PoD cache list. If such request fails, we try 2MB PoD
> cache
>> list. If both fail, we try 4KB list.
>> 
>> 
>> 
>> * p2m_pod_zero_check_superpage_1gb()
>> 
>> This adds a new function to check for 1GB page. This function is
> similar to
>> p2m_pod_zero_check_superpage_2mb().
>> 
>> 
>> 
>> * p2m_pod_zero_check_superpage_1gb()
>> 
>> We add a new function to sweep 1GB page from guest memory. This is the
> same
>> as p2m_pod_zero_check_superpage_2mb().
>> 
>> 
>> 
>> * p2m_pod_demand_populate()
>> 
>> The trick of this function is to do remap_and_retry if
> p2m_pod_cache_get()
>> fails. When p2m_pod_get() fails, this function will splits p2m table
> entry
>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can
> guarantee
>> populate demands always work.
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>> 
>> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Dongxiao

2010-Jan-13 04:07 UTC

head link

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Hi, Wei, 
	I digged out this thread from looooong history...
	Do you still have plan to push your 1gb_p2m and 1gb_tool patches into upstream?
	I rebased them to fit the latest upstream code (the 1gb_tools.patch is not
changed).
	Comment on that or any new idea? Thanks!

Best Regards,
-- Dongxiao

Huang2, Wei wrote:> Here are patches using the middle approach. It handles 1GB pages in
> PoD by remapping 1GB with 2MB pages & retry. I also added code for 1GB
> detection. Please comment.
> 
> Thanks a lot,
> 
> -Wei
> 
> -----Original Message-----
> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
> Dunlap
> Sent: Wednesday, March 18, 2009 12:20 PM
> To: Huang2, Wei
> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; Tim
> Deegan Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table
> Support 
> 
> Thanks for doing this work, Wei -- especially all the extra effort for
> the PoD integration.
> 
> One question: How well would you say you''ve tested the PoD
> functionality?  Or to put it the other way, how much do I need to
> prioritize testing this before the 3.4 release?
> 
> It wouldn''t be a bad idea to do as you suggested, and break things
> into 2 meg pages for the PoD case.  In order to take the best
> advantage of this in a PoD scenario, you''d need to have a balloon
> driver that could allocate 1G of continuous *guest* p2m space, which
> seems a bit optimistic at this point...
> 
>  -George
> 
> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>> Current Xen supports 2MB super pages for NPT/EPT. The attached
>> patches extend this feature to support 1GB pages. The PoD
>> (populate-on-demand) introduced by George Dunlap made P2M
>> modification harder. I tried to preserve existing PoD design by
>> introducing a 1GB PoD cache list. 
>> 
>> 
>> 
>> Note that 1GB PoD can be dropped if we don''t care about 1GB
when PoD
>> is enabled. In this case, we can just split 1GB PDPE into 512x2MB
>> PDE entries and grab pages from PoD super list. That can pretty much
>> make 1gb_p2m_pod.patch go away. 
>> 
>> 
>> 
>> Any comment/suggestion on design idea will be appreciated.
>> 
>> 
>> 
>> Thanks,
>> 
>> 
>> 
>> -Wei
>> 
>> 
>> 
>> 
>> 
>> The following is the description:
>> 
>> === 1gb_tools.patch ==>> 
>> Extend existing setup_guest() function. Basically, it tries to
>> allocate 1GB pages whenever available. If this request fails, it
>> falls back to 2MB. If both fail, then 4KB pages will be used.
>> 
>> 
>> 
>> === 1gb_p2m.patch ==>> 
>> * p2m_next_level()
>> 
>> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we
>> split 1GB into 512 2MB pages. 
>> 
>> 
>> 
>> * p2m_set_entry()
>> 
>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
>> 
>> 
>> 
>> * p2m_gfn_to_mfn()
>> 
>> Add support for 1GB case when doing gfn to mfn translation. When L3
>> entry is marked as POPULATE_ON_DEMAND, we call
>> 2m_pod_demand_populate(). Otherwise, we do the regular address
>> translation (gfn ==> mfn). 
>> 
>> 
>> 
>> * p2m_gfn_to_mfn_current()
>> 
>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
>> POPULATE_ON_DEMAND, it demands a populate using
>> p2m_pod_demand_populate(). Otherwise, it does a normal translation.
>> 1GB page is taken into consideration. 
>> 
>> 
>> 
>> * set_p2m_entry()
>> 
>> Request 1GB page
>> 
>> 
>> 
>> * audit_p2m()
>> 
>> Support 1GB while auditing p2m table.
>> 
>> 
>> 
>> * p2m_change_type_global()
>> 
>> Deal with 1GB page when changing global page type.
>> 
>> 
>> 
>> === 1gb_p2m_pod.patch ==>> 
>> * xen/include/asm-x86/p2m.h
>> 
>> Minor change to deal with PoD. It separates super page cache list
>> into 2MB and 1GB lists. Similarly, we record last gpfn of sweeping
>> for both 2MB and 1GB. 
>> 
>> 
>> 
>> * p2m_pod_cache_add()
>> 
>> Check page order and add 1GB super page into PoD 1GB cache list.
>> 
>> 
>> 
>> * p2m_pod_cache_get()
>> 
>> Grab a page from cache list. It tries to break 1GB page into 512 2MB
>> pages if 2MB PoD list is empty. Similarly, 4KB can be requested from
>> super pages. The breaking order is 2MB then 1GB.
>> 
>> 
>> 
>> * p2m_pod_cache_target()
>> 
>> This function is used to set PoD cache size. To increase PoD target,
>> we try to allocate 1GB from xen domheap. If this fails, we try 2MB.
>> If both fail, we try 4KB which is guaranteed to work.
>> 
>> 
>> 
>> To decrease the target, we use a similar approach. We first try to
>> free 1GB pages from 1GB PoD cache list. If such request fails, we
>> try 2MB PoD cache list. If both fail, we try 4KB list.
>> 
>> 
>> 
>> * p2m_pod_zero_check_superpage_1gb()
>> 
>> This adds a new function to check for 1GB page. This function is
>> similar to p2m_pod_zero_check_superpage_2mb().
>> 
>> 
>> 
>> * p2m_pod_zero_check_superpage_1gb()
>> 
>> We add a new function to sweep 1GB page from guest memory. This is
>> the same as p2m_pod_zero_check_superpage_2mb().
>> 
>> 
>> 
>> * p2m_pod_demand_populate()
>> 
>> The trick of this function is to do remap_and_retry if
>> p2m_pod_cache_get() fails. When p2m_pod_get() fails, this function
>> will splits p2m table entry into smaller ones (e.g. 1GB ==> 2MB or
>> 2MB ==> 4KB). That can guarantee populate demands always work. 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei Huang

2010-Jan-13 16:27 UTC

head link

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Dongxiao,

Thanks for re-basing the code. Does the new code work for you? I need to 
test the new code again on my machine and make sure it doesn''t break. 
After that, we can ask Keir to push into upstream.

-Wei


Xu, Dongxiao wrote:> Hi, Wei, 
> 	I digged out this thread from looooong history...
> 	Do you still have plan to push your 1gb_p2m and 1gb_tool patches into
upstream?
> 	I rebased them to fit the latest upstream code (the 1gb_tools.patch is not
changed).
> 	Comment on that or any new idea? Thanks!
>
> Best Regards,
> -- Dongxiao
>
> Huang2, Wei wrote:
>   
>> Here are patches using the middle approach. It handles 1GB pages in
>> PoD by remapping 1GB with 2MB pages & retry. I also added code for
1GB
>> detection. Please comment.
>>
>> Thanks a lot,
>>
>> -Wei
>>
>> -----Original Message-----
>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
>> Dunlap
>> Sent: Wednesday, March 18, 2009 12:20 PM
>> To: Huang2, Wei
>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; Tim
>> Deegan Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table
>> Support 
>>
>> Thanks for doing this work, Wei -- especially all the extra effort for
>> the PoD integration.
>>
>> One question: How well would you say you''ve tested the PoD
>> functionality?  Or to put it the other way, how much do I need to
>> prioritize testing this before the 3.4 release?
>>
>> It wouldn''t be a bad idea to do as you suggested, and break
things
>> into 2 meg pages for the PoD case.  In order to take the best
>> advantage of this in a PoD scenario, you''d need to have a
balloon
>> driver that could allocate 1G of continuous *guest* p2m space, which
>> seems a bit optimistic at this point...
>>
>>  -George
>>
>> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>>     
>>> Current Xen supports 2MB super pages for NPT/EPT. The attached
>>> patches extend this feature to support 1GB pages. The PoD
>>> (populate-on-demand) introduced by George Dunlap made P2M
>>> modification harder. I tried to preserve existing PoD design by
>>> introducing a 1GB PoD cache list. 
>>>
>>>
>>>
>>> Note that 1GB PoD can be dropped if we don''t care about
1GB when PoD
>>> is enabled. In this case, we can just split 1GB PDPE into 512x2MB
>>> PDE entries and grab pages from PoD super list. That can pretty
much
>>> make 1gb_p2m_pod.patch go away. 
>>>
>>>
>>>
>>> Any comment/suggestion on design idea will be appreciated.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> -Wei
>>>
>>>
>>>
>>>
>>>
>>> The following is the description:
>>>
>>> === 1gb_tools.patch ==>>>
>>> Extend existing setup_guest() function. Basically, it tries to
>>> allocate 1GB pages whenever available. If this request fails, it
>>> falls back to 2MB. If both fail, then 4KB pages will be used.
>>>
>>>
>>>
>>> === 1gb_p2m.patch ==>>>
>>> * p2m_next_level()
>>>
>>> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1), we
>>> split 1GB into 512 2MB pages. 
>>>
>>>
>>>
>>> * p2m_set_entry()
>>>
>>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB).
>>>
>>>
>>>
>>> * p2m_gfn_to_mfn()
>>>
>>> Add support for 1GB case when doing gfn to mfn translation. When L3
>>> entry is marked as POPULATE_ON_DEMAND, we call
>>> 2m_pod_demand_populate(). Otherwise, we do the regular address
>>> translation (gfn ==> mfn). 
>>>
>>>
>>>
>>> * p2m_gfn_to_mfn_current()
>>>
>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
>>> POPULATE_ON_DEMAND, it demands a populate using
>>> p2m_pod_demand_populate(). Otherwise, it does a normal translation.
>>> 1GB page is taken into consideration. 
>>>
>>>
>>>
>>> * set_p2m_entry()
>>>
>>> Request 1GB page
>>>
>>>
>>>
>>> * audit_p2m()
>>>
>>> Support 1GB while auditing p2m table.
>>>
>>>
>>>
>>> * p2m_change_type_global()
>>>
>>> Deal with 1GB page when changing global page type.
>>>
>>>
>>>
>>> === 1gb_p2m_pod.patch ==>>>
>>> * xen/include/asm-x86/p2m.h
>>>
>>> Minor change to deal with PoD. It separates super page cache list
>>> into 2MB and 1GB lists. Similarly, we record last gpfn of sweeping
>>> for both 2MB and 1GB. 
>>>
>>>
>>>
>>> * p2m_pod_cache_add()
>>>
>>> Check page order and add 1GB super page into PoD 1GB cache list.
>>>
>>>
>>>
>>> * p2m_pod_cache_get()
>>>
>>> Grab a page from cache list. It tries to break 1GB page into 512
2MB
>>> pages if 2MB PoD list is empty. Similarly, 4KB can be requested
from
>>> super pages. The breaking order is 2MB then 1GB.
>>>
>>>
>>>
>>> * p2m_pod_cache_target()
>>>
>>> This function is used to set PoD cache size. To increase PoD
target,
>>> we try to allocate 1GB from xen domheap. If this fails, we try 2MB.
>>> If both fail, we try 4KB which is guaranteed to work.
>>>
>>>
>>>
>>> To decrease the target, we use a similar approach. We first try to
>>> free 1GB pages from 1GB PoD cache list. If such request fails, we
>>> try 2MB PoD cache list. If both fail, we try 4KB list.
>>>
>>>
>>>
>>> * p2m_pod_zero_check_superpage_1gb()
>>>
>>> This adds a new function to check for 1GB page. This function is
>>> similar to p2m_pod_zero_check_superpage_2mb().
>>>
>>>
>>>
>>> * p2m_pod_zero_check_superpage_1gb()
>>>
>>> We add a new function to sweep 1GB page from guest memory. This is
>>> the same as p2m_pod_zero_check_superpage_2mb().
>>>
>>>
>>>
>>> * p2m_pod_demand_populate()
>>>
>>> The trick of this function is to do remap_and_retry if
>>> p2m_pod_cache_get() fails. When p2m_pod_get() fails, this function
>>> will splits p2m table entry into smaller ones (e.g. 1GB ==> 2MB
or
>>> 2MB ==> 4KB). That can guarantee populate demands always work. 
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Dongxiao

2010-Jan-14 01:36 UTC

head link

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Wei, 
    Yes, basically it works for me. However I didn''t test it too much. 
For example, I haven''t tested the PoD functionality with the patch. 

Thanks!
Dongxiao

Wei Huang wrote:> Dongxiao,
> 
> Thanks for re-basing the code. Does the new code work for you? I need
> to test the new code again on my machine and make sure it doesn''t
> break. After that, we can ask Keir to push into upstream.
> 
> -Wei
> 
> 
> Xu, Dongxiao wrote:
>> Hi, Wei,
>> 	I digged out this thread from looooong history...
>> 	Do you still have plan to push your 1gb_p2m and 1gb_tool patches
>> 	into upstream? I rebased them to fit the latest upstream code (the
>> 	1gb_tools.patch is not changed). Comment on that or any new idea?
>> Thanks! 
>> 
>> Best Regards,
>> -- Dongxiao
>> 
>> Huang2, Wei wrote:
>> 
>>> Here are patches using the middle approach. It handles 1GB pages in
>>> PoD by remapping 1GB with 2MB pages & retry. I also added code
for
>>> 1GB detection. Please comment. 
>>> 
>>> Thanks a lot,
>>> 
>>> -Wei
>>> 
>>> -----Original Message-----
>>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
>>> George Dunlap Sent: Wednesday, March 18, 2009 12:20 PM
>>> To: Huang2, Wei
>>> Cc: xen-devel@lists.xensource.com; keir.fraser@eu.citrix.com; Tim
>>> Deegan Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table
>>> Support 
>>> 
>>> Thanks for doing this work, Wei -- especially all the extra effort
>>> for the PoD integration. 
>>> 
>>> One question: How well would you say you''ve tested the PoD
>>> functionality?  Or to put it the other way, how much do I need to
>>> prioritize testing this before the 3.4 release?
>>> 
>>> It wouldn''t be a bad idea to do as you suggested, and
break things
>>> into 2 meg pages for the PoD case.  In order to take the best
>>> advantage of this in a PoD scenario, you''d need to have a
balloon
>>> driver that could allocate 1G of continuous *guest* p2m space,
which
>>> seems a bit optimistic at this point...
>>> 
>>>  -George
>>> 
>>> 2009/3/18 Huang2, Wei <Wei.Huang2@amd.com>:
>>> 
>>>> Current Xen supports 2MB super pages for NPT/EPT. The attached
>>>> patches extend this feature to support 1GB pages. The PoD
>>>> (populate-on-demand) introduced by George Dunlap made P2M
>>>> modification harder. I tried to preserve existing PoD design by
>>>> introducing a 1GB PoD cache list.
>>>> 
>>>> 
>>>> 
>>>> Note that 1GB PoD can be dropped if we don''t care
about 1GB when
>>>> PoD is enabled. In this case, we can just split 1GB PDPE into
>>>> 512x2MB PDE entries and grab pages from PoD super list. That
can
>>>> pretty much make 1gb_p2m_pod.patch go away.
>>>> 
>>>> 
>>>> 
>>>> Any comment/suggestion on design idea will be appreciated.
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> 
>>>> 
>>>> -Wei
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> The following is the description:
>>>> 
>>>> === 1gb_tools.patch ==>>>> 
>>>> Extend existing setup_guest() function. Basically, it tries to
>>>> allocate 1GB pages whenever available. If this request fails,
it
>>>> falls back to 2MB. If both fail, then 4KB pages will be used.
>>>> 
>>>> 
>>>> 
>>>> === 1gb_p2m.patch ==>>>> 
>>>> * p2m_next_level()
>>>> 
>>>> Check PSE bit of L3 page table entry. If 1GB is found (PSE=1),
we
>>>> split 1GB into 512 2MB pages. 
>>>> 
>>>> 
>>>> 
>>>> * p2m_set_entry()
>>>> 
>>>> Configure the PSE bit of L3 P2M table if page order == 18
(1GB).
>>>> 
>>>> 
>>>> 
>>>> * p2m_gfn_to_mfn()
>>>> 
>>>> Add support for 1GB case when doing gfn to mfn translation.
When L3
>>>> entry is marked as POPULATE_ON_DEMAND, we call
>>>> 2m_pod_demand_populate(). Otherwise, we do the regular address
>>>> translation (gfn ==> mfn). 
>>>> 
>>>> 
>>>> 
>>>> * p2m_gfn_to_mfn_current()
>>>> 
>>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as
>>>> POPULATE_ON_DEMAND, it demands a populate using
>>>> p2m_pod_demand_populate(). Otherwise, it does a normal
translation.
>>>> 1GB page is taken into consideration.
>>>> 
>>>> 
>>>> 
>>>> * set_p2m_entry()
>>>> 
>>>> Request 1GB page
>>>> 
>>>> 
>>>> 
>>>> * audit_p2m()
>>>> 
>>>> Support 1GB while auditing p2m table.
>>>> 
>>>> 
>>>> 
>>>> * p2m_change_type_global()
>>>> 
>>>> Deal with 1GB page when changing global page type.
>>>> 
>>>> 
>>>> 
>>>> === 1gb_p2m_pod.patch ==>>>> 
>>>> * xen/include/asm-x86/p2m.h
>>>> 
>>>> Minor change to deal with PoD. It separates super page cache
list
>>>> into 2MB and 1GB lists. Similarly, we record last gpfn of
sweeping
>>>> for both 2MB and 1GB. 
>>>> 
>>>> 
>>>> 
>>>> * p2m_pod_cache_add()
>>>> 
>>>> Check page order and add 1GB super page into PoD 1GB cache
list.
>>>> 
>>>> 
>>>> 
>>>> * p2m_pod_cache_get()
>>>> 
>>>> Grab a page from cache list. It tries to break 1GB page into
512
>>>> 2MB pages if 2MB PoD list is empty. Similarly, 4KB can be
>>>> requested from super pages. The breaking order is 2MB then 1GB.
>>>> 
>>>> 
>>>> 
>>>> * p2m_pod_cache_target()
>>>> 
>>>> This function is used to set PoD cache size. To increase PoD
>>>> target, we try to allocate 1GB from xen domheap. If this fails,
we
>>>> try 2MB. If both fail, we try 4KB which is guaranteed to work.
>>>> 
>>>> 
>>>> 
>>>> To decrease the target, we use a similar approach. We first try
to
>>>> free 1GB pages from 1GB PoD cache list. If such request fails,
we
>>>> try 2MB PoD cache list. If both fail, we try 4KB list.
>>>> 
>>>> 
>>>> 
>>>> * p2m_pod_zero_check_superpage_1gb()
>>>> 
>>>> This adds a new function to check for 1GB page. This function
is
>>>> similar to p2m_pod_zero_check_superpage_2mb().
>>>> 
>>>> 
>>>> 
>>>> * p2m_pod_zero_check_superpage_1gb()
>>>> 
>>>> We add a new function to sweep 1GB page from guest memory. This
is
>>>> the same as p2m_pod_zero_check_superpage_2mb().
>>>> 
>>>> 
>>>> 
>>>> * p2m_pod_demand_populate()
>>>> 
>>>> The trick of this function is to do remap_and_retry if
>>>> p2m_pod_cache_get() fails. When p2m_pod_get() fails, this
function
>>>> will splits p2m table entry into smaller ones (e.g. 1GB ==>
2MB or
>>>> 2MB ==> 4KB). That can guarantee populate demands always
work.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xensource.com
>>>> http://lists.xensource.com/xen-devel_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Mar 2009 - [RFC][Patches] Xen 1GB Page Table Support

[Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support