Hi, Stefano I found another bug in the balloon scratch page code. As I didn''t follow the discussion on scratch page so I cannot propose a proper fix at the moment. The problem is that in balloon.c:increase_reservation, when a ballooned page is resued, it can have a valid P2M entry pointing to the scratch, hitting the BUG_ON BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) && phys_to_machine_mapping_valid(pfn)); As balloon worker might run by a CPU other then the one that returns the page, checking pfn_to_mfn(pfn) == local_cpu_scratch_page_mfn wouldn''t work. Checking pfn_to_mfn(pfn) belongs to the set of all scratch page mfns is not desirable. My thoughts so far: 1. remove that BUG_ON (looks like a wrong fix though) 2. make balloon scratch page global Other thoughts? Wei.
On Mon, 2013-09-02 at 15:43 +0100, Wei Liu wrote:> Hi, Stefano > > I found another bug in the balloon scratch page code. As I didn''t follow > the discussion on scratch page so I cannot propose a proper fix at the > moment. > > The problem is that in balloon.c:increase_reservation, when a ballooned > page is resued, it can have a valid P2M entry pointing to the scratch, > hitting the BUG_ON > > BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) && > phys_to_machine_mapping_valid(pfn)); > > As balloon worker might run by a CPU other then the one that returns the > page, checking pfn_to_mfn(pfn) == local_cpu_scratch_page_mfn wouldn''t > work. Checking pfn_to_mfn(pfn) belongs to the set of all scratch page > mfns is not desirable.This makes me think that whoever suggested that pfn_to_mfn for a ballooned page out to return INVALID_MFN was right. Even though we happen to have put a mapping there there pages are "logically empty" and that''s really what callers such as phys_to_machine_mapping_valid care about, not whether the processor thinks there is a mapping or not.> > My thoughts so far: > 1. remove that BUG_ON (looks like a wrong fix though) > 2. make balloon scratch page global > > Other thoughts? > > Wei. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Mon, Sep 02, 2013 at 03:48:43PM +0100, Ian Campbell wrote:> On Mon, 2013-09-02 at 15:43 +0100, Wei Liu wrote: > > Hi, Stefano > > > > I found another bug in the balloon scratch page code. As I didn''t follow > > the discussion on scratch page so I cannot propose a proper fix at the > > moment. > > > > The problem is that in balloon.c:increase_reservation, when a ballooned > > page is resued, it can have a valid P2M entry pointing to the scratch, > > hitting the BUG_ON > > > > BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) && > > phys_to_machine_mapping_valid(pfn)); > > > > As balloon worker might run by a CPU other then the one that returns the > > page, checking pfn_to_mfn(pfn) == local_cpu_scratch_page_mfn wouldn''t > > work. Checking pfn_to_mfn(pfn) belongs to the set of all scratch page > > mfns is not desirable. > > This makes me think that whoever suggested that pfn_to_mfn for a > ballooned page out to return INVALID_MFN was right. >If there are many balloon pages the check can be expensive. If we make the scratch page globally shared by all CPUs the check can be less expensive? I don''t understand why we need to have one page per CPU at first glance. Wei.> Even though we happen to have put a mapping there there pages are > "logically empty" and that''s really what callers such as > phys_to_machine_mapping_valid care about, not whether the processor > thinks there is a mapping or not. > > > > > My thoughts so far: > > 1. remove that BUG_ON (looks like a wrong fix though) > > 2. make balloon scratch page global > > > > Other thoughts? > > > > Wei. > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xen.org > > http://lists.xen.org/xen-devel >
On 02/09/13 16:04, Wei Liu wrote:> On Mon, Sep 02, 2013 at 03:48:43PM +0100, Ian Campbell wrote: >> On Mon, 2013-09-02 at 15:43 +0100, Wei Liu wrote: >>> Hi, Stefano >>> >>> I found another bug in the balloon scratch page code. As I didn''t follow >>> the discussion on scratch page so I cannot propose a proper fix at the >>> moment. >>> >>> The problem is that in balloon.c:increase_reservation, when a ballooned >>> page is resued, it can have a valid P2M entry pointing to the scratch, >>> hitting the BUG_ON >>> >>> BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) && >>> phys_to_machine_mapping_valid(pfn)); >>> >>> As balloon worker might run by a CPU other then the one that returns the >>> page, checking pfn_to_mfn(pfn) == local_cpu_scratch_page_mfn wouldn''t >>> work. Checking pfn_to_mfn(pfn) belongs to the set of all scratch page >>> mfns is not desirable. >> >> This makes me think that whoever suggested that pfn_to_mfn for a >> ballooned page out to return INVALID_MFN was right. >> > > If there are many balloon pages the check can be expensive. If we make > the scratch page globally shared by all CPUs the check can be less > expensive? I don''t understand why we need to have one page per CPU at > first glance.There needs to be a per-CPU mapping of the scratch pages as part of doing the unmap_and_replace with the scratch page, its mapping is cleared. It is later restored as a separate hypercall. David
On Mon, 2013-09-02 at 16:04 +0100, Wei Liu wrote:> On Mon, Sep 02, 2013 at 03:48:43PM +0100, Ian Campbell wrote: > > On Mon, 2013-09-02 at 15:43 +0100, Wei Liu wrote: > > > Hi, Stefano > > > > > > I found another bug in the balloon scratch page code. As I didn''t follow > > > the discussion on scratch page so I cannot propose a proper fix at the > > > moment. > > > > > > The problem is that in balloon.c:increase_reservation, when a ballooned > > > page is resued, it can have a valid P2M entry pointing to the scratch, > > > hitting the BUG_ON > > > > > > BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) && > > > phys_to_machine_mapping_valid(pfn)); > > > > > > As balloon worker might run by a CPU other then the one that returns the > > > page, checking pfn_to_mfn(pfn) == local_cpu_scratch_page_mfn wouldn''t > > > work. Checking pfn_to_mfn(pfn) belongs to the set of all scratch page > > > mfns is not desirable. > > > > This makes me think that whoever suggested that pfn_to_mfn for a > > ballooned page out to return INVALID_MFN was right. > > > > If there are many balloon pages the check can be expensive.IIRC the suggestion was that the p2m for a ballooned out page would contain INVALID_MFN, so the expense is just the lookup you would be doing anyway.
On Mon, Sep 02, 2013 at 04:09:26PM +0100, Ian Campbell wrote:> On Mon, 2013-09-02 at 16:04 +0100, Wei Liu wrote: > > On Mon, Sep 02, 2013 at 03:48:43PM +0100, Ian Campbell wrote: > > > On Mon, 2013-09-02 at 15:43 +0100, Wei Liu wrote: > > > > Hi, Stefano > > > > > > > > I found another bug in the balloon scratch page code. As I didn''t follow > > > > the discussion on scratch page so I cannot propose a proper fix at the > > > > moment. > > > > > > > > The problem is that in balloon.c:increase_reservation, when a ballooned > > > > page is resued, it can have a valid P2M entry pointing to the scratch, > > > > hitting the BUG_ON > > > > > > > > BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) && > > > > phys_to_machine_mapping_valid(pfn)); > > > > > > > > As balloon worker might run by a CPU other then the one that returns the > > > > page, checking pfn_to_mfn(pfn) == local_cpu_scratch_page_mfn wouldn''t > > > > work. Checking pfn_to_mfn(pfn) belongs to the set of all scratch page > > > > mfns is not desirable. > > > > > > This makes me think that whoever suggested that pfn_to_mfn for a > > > ballooned page out to return INVALID_MFN was right. > > > > > > > If there are many balloon pages the check can be expensive. > > IIRC the suggestion was that the p2m for a ballooned out page would > contain INVALID_MFN, so the expense is just the lookup you would be > doing anyway. >That was David''s idea. Stefano was worried that other PVOPS hooks would need to know about the MFN. I don''t know how much it holds true. Need some input from Konrad I think. Wei.
On Mon, 2013-09-02 at 16:13 +0100, Wei Liu wrote:> On Mon, Sep 02, 2013 at 04:09:26PM +0100, Ian Campbell wrote: > > On Mon, 2013-09-02 at 16:04 +0100, Wei Liu wrote: > > > On Mon, Sep 02, 2013 at 03:48:43PM +0100, Ian Campbell wrote: > > > > On Mon, 2013-09-02 at 15:43 +0100, Wei Liu wrote: > > > > > Hi, Stefano > > > > > > > > > > I found another bug in the balloon scratch page code. As I didn''t follow > > > > > the discussion on scratch page so I cannot propose a proper fix at the > > > > > moment. > > > > > > > > > > The problem is that in balloon.c:increase_reservation, when a ballooned > > > > > page is resued, it can have a valid P2M entry pointing to the scratch, > > > > > hitting the BUG_ON > > > > > > > > > > BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) && > > > > > phys_to_machine_mapping_valid(pfn)); > > > > > > > > > > As balloon worker might run by a CPU other then the one that returns the > > > > > page, checking pfn_to_mfn(pfn) == local_cpu_scratch_page_mfn wouldn''t > > > > > work. Checking pfn_to_mfn(pfn) belongs to the set of all scratch page > > > > > mfns is not desirable. > > > > > > > > This makes me think that whoever suggested that pfn_to_mfn for a > > > > ballooned page out to return INVALID_MFN was right. > > > > > > > > > > If there are many balloon pages the check can be expensive. > > > > IIRC the suggestion was that the p2m for a ballooned out page would > > contain INVALID_MFN, so the expense is just the lookup you would be > > doing anyway. > > > > That was David''s idea. Stefano was worried that other PVOPS hooks would > need to know about the MFN. I don''t know how much it holds true. Need > some input from Konrad I think.Hrm, aren''t those expected to not be operating on ballooned out pages anyway? Prior to this change those callers would have got INVALID_MFN, I think? My gut feeling is that places which accidentally/unexpectedly have a ballooned out page in their hand have either a virtual address or an mfn (e.g. a cr2 fault address) and not a pfn in their hands, and won''t in general be that interested in the pfn and/or are already geared up to deal with INVALID_MFN because previously that''s what they would have gotten. There''s every chance I''m wrong though. I wonder how migration copes with all this cleverness BTW... Ian.
On Mon, 2 Sep 2013, Ian Campbell wrote:> On Mon, 2013-09-02 at 16:13 +0100, Wei Liu wrote: > > On Mon, Sep 02, 2013 at 04:09:26PM +0100, Ian Campbell wrote: > > > On Mon, 2013-09-02 at 16:04 +0100, Wei Liu wrote: > > > > On Mon, Sep 02, 2013 at 03:48:43PM +0100, Ian Campbell wrote: > > > > > On Mon, 2013-09-02 at 15:43 +0100, Wei Liu wrote: > > > > > > Hi, Stefano > > > > > > > > > > > > I found another bug in the balloon scratch page code. As I didn''t follow > > > > > > the discussion on scratch page so I cannot propose a proper fix at the > > > > > > moment. > > > > > > > > > > > > The problem is that in balloon.c:increase_reservation, when a ballooned > > > > > > page is resued, it can have a valid P2M entry pointing to the scratch, > > > > > > hitting the BUG_ON > > > > > > > > > > > > BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) && > > > > > > phys_to_machine_mapping_valid(pfn)); > > > > > > > > > > > > As balloon worker might run by a CPU other then the one that returns the > > > > > > page, checking pfn_to_mfn(pfn) == local_cpu_scratch_page_mfn wouldn''t > > > > > > work. Checking pfn_to_mfn(pfn) belongs to the set of all scratch page > > > > > > mfns is not desirable. > > > > > > > > > > This makes me think that whoever suggested that pfn_to_mfn for a > > > > > ballooned page out to return INVALID_MFN was right. > > > > > > > > > > > > > If there are many balloon pages the check can be expensive. > > > > > > IIRC the suggestion was that the p2m for a ballooned out page would > > > contain INVALID_MFN, so the expense is just the lookup you would be > > > doing anyway. > > > > > > > That was David''s idea. Stefano was worried that other PVOPS hooks would > > need to know about the MFN. I don''t know how much it holds true. Need > > some input from Konrad I think. > > Hrm, aren''t those expected to not be operating on ballooned out pages > anyway? Prior to this change those callers would have got INVALID_MFN, I > think? > > My gut feeling is that places which accidentally/unexpectedly have a > ballooned out page in their hand have either a virtual address or an mfn > (e.g. a cr2 fault address) and not a pfn in their hands, and won''t in > general be that interested in the pfn and/or are already geared up to > deal with INVALID_MFN because previously that''s what they would have > gotten. There''s every chance I''m wrong though.I think we should just remove the BUG_ON: - if the guest is autotranslate, then the BUG_ON is already irrelevant; - if the guest is not autotranslate, then we are sure that the p2m of the page is pointing to a scratch page; either way the BUG_ON is wrong. Otherwise could simply implement a is_balloon_scratch_page function that checks whether a given pfn corresponds to any of the scratch pages (it doesn''t need to be the scratch page of this cpu).
On Wed, Sep 04, 2013 at 02:15:51PM +0100, Stefano Stabellini wrote:> On Mon, 2 Sep 2013, Ian Campbell wrote: > > On Mon, 2013-09-02 at 16:13 +0100, Wei Liu wrote: > > > On Mon, Sep 02, 2013 at 04:09:26PM +0100, Ian Campbell wrote: > > > > On Mon, 2013-09-02 at 16:04 +0100, Wei Liu wrote: > > > > > On Mon, Sep 02, 2013 at 03:48:43PM +0100, Ian Campbell wrote: > > > > > > On Mon, 2013-09-02 at 15:43 +0100, Wei Liu wrote: > > > > > > > Hi, Stefano > > > > > > > > > > > > > > I found another bug in the balloon scratch page code. As I didn''t follow > > > > > > > the discussion on scratch page so I cannot propose a proper fix at the > > > > > > > moment. > > > > > > > > > > > > > > The problem is that in balloon.c:increase_reservation, when a ballooned > > > > > > > page is resued, it can have a valid P2M entry pointing to the scratch, > > > > > > > hitting the BUG_ON > > > > > > > > > > > > > > BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) && > > > > > > > phys_to_machine_mapping_valid(pfn)); > > > > > > > > > > > > > > As balloon worker might run by a CPU other then the one that returns the > > > > > > > page, checking pfn_to_mfn(pfn) == local_cpu_scratch_page_mfn wouldn''t > > > > > > > work. Checking pfn_to_mfn(pfn) belongs to the set of all scratch page > > > > > > > mfns is not desirable. > > > > > > > > > > > > This makes me think that whoever suggested that pfn_to_mfn for a > > > > > > ballooned page out to return INVALID_MFN was right. > > > > > > > > > > > > > > > > If there are many balloon pages the check can be expensive. > > > > > > > > IIRC the suggestion was that the p2m for a ballooned out page would > > > > contain INVALID_MFN, so the expense is just the lookup you would be > > > > doing anyway. > > > > > > > > > > That was David''s idea. Stefano was worried that other PVOPS hooks would > > > need to know about the MFN. I don''t know how much it holds true. Need > > > some input from Konrad I think. > > > > Hrm, aren''t those expected to not be operating on ballooned out pages > > anyway? Prior to this change those callers would have got INVALID_MFN, I > > think? > > > > My gut feeling is that places which accidentally/unexpectedly have a > > ballooned out page in their hand have either a virtual address or an mfn > > (e.g. a cr2 fault address) and not a pfn in their hands, and won''t in > > general be that interested in the pfn and/or are already geared up to > > deal with INVALID_MFN because previously that''s what they would have > > gotten. There''s every chance I''m wrong though. > > > I think we should just remove the BUG_ON: > > - if the guest is autotranslate, then the BUG_ON is already irrelevant; > - if the guest is not autotranslate, then we are sure that the p2m of > the page is pointing to a scratch page; > > either way the BUG_ON is wrong.Is it there to guard other misuse of the function? IOW can we be confident that all calls are legit? Messing up the P2M table looks very hard to debug?> > Otherwise could simply implement a is_balloon_scratch_page function that > checks whether a given pfn corresponds to any of the scratch pages (it > doesn''t need to be the scratch page of this cpu).That''s quite expensive IMHO, especially when you have lots of CPU''s and lots of ballooned pages. Wei.
On Wed, 4 Sep 2013, Wei Liu wrote:> On Wed, Sep 04, 2013 at 02:15:51PM +0100, Stefano Stabellini wrote: > > On Mon, 2 Sep 2013, Ian Campbell wrote: > > > On Mon, 2013-09-02 at 16:13 +0100, Wei Liu wrote: > > > > On Mon, Sep 02, 2013 at 04:09:26PM +0100, Ian Campbell wrote: > > > > > On Mon, 2013-09-02 at 16:04 +0100, Wei Liu wrote: > > > > > > On Mon, Sep 02, 2013 at 03:48:43PM +0100, Ian Campbell wrote: > > > > > > > On Mon, 2013-09-02 at 15:43 +0100, Wei Liu wrote: > > > > > > > > Hi, Stefano > > > > > > > > > > > > > > > > I found another bug in the balloon scratch page code. As I didn''t follow > > > > > > > > the discussion on scratch page so I cannot propose a proper fix at the > > > > > > > > moment. > > > > > > > > > > > > > > > > The problem is that in balloon.c:increase_reservation, when a ballooned > > > > > > > > page is resued, it can have a valid P2M entry pointing to the scratch, > > > > > > > > hitting the BUG_ON > > > > > > > > > > > > > > > > BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) && > > > > > > > > phys_to_machine_mapping_valid(pfn)); > > > > > > > > > > > > > > > > As balloon worker might run by a CPU other then the one that returns the > > > > > > > > page, checking pfn_to_mfn(pfn) == local_cpu_scratch_page_mfn wouldn''t > > > > > > > > work. Checking pfn_to_mfn(pfn) belongs to the set of all scratch page > > > > > > > > mfns is not desirable. > > > > > > > > > > > > > > This makes me think that whoever suggested that pfn_to_mfn for a > > > > > > > ballooned page out to return INVALID_MFN was right. > > > > > > > > > > > > > > > > > > > If there are many balloon pages the check can be expensive. > > > > > > > > > > IIRC the suggestion was that the p2m for a ballooned out page would > > > > > contain INVALID_MFN, so the expense is just the lookup you would be > > > > > doing anyway. > > > > > > > > > > > > > That was David''s idea. Stefano was worried that other PVOPS hooks would > > > > need to know about the MFN. I don''t know how much it holds true. Need > > > > some input from Konrad I think. > > > > > > Hrm, aren''t those expected to not be operating on ballooned out pages > > > anyway? Prior to this change those callers would have got INVALID_MFN, I > > > think? > > > > > > My gut feeling is that places which accidentally/unexpectedly have a > > > ballooned out page in their hand have either a virtual address or an mfn > > > (e.g. a cr2 fault address) and not a pfn in their hands, and won''t in > > > general be that interested in the pfn and/or are already geared up to > > > deal with INVALID_MFN because previously that''s what they would have > > > gotten. There''s every chance I''m wrong though. > > > > > > I think we should just remove the BUG_ON: > > > > - if the guest is autotranslate, then the BUG_ON is already irrelevant; > > - if the guest is not autotranslate, then we are sure that the p2m of > > the page is pointing to a scratch page; > > > > either way the BUG_ON is wrong. > > Is it there to guard other misuse of the function? IOW can we be > confident that all calls are legit?the ballooned_pages list would need to be broken for this BUG_ON to trigger> Messing up the P2M table looks very hard to debug? > > > Otherwise could simply implement a is_balloon_scratch_page function that > > checks whether a given pfn corresponds to any of the scratch pages (it > > doesn''t need to be the scratch page of this cpu). > > That''s quite expensive IMHO, especially when you have lots of CPU''s and > lots of ballooned pages.as a compromise we could run a check on the pfn only on debug runs?
Konrad Rzeszutek Wilk
2013-Sep-04 15:14 UTC
Re: Balloon driver bug in increase_reservation
On Wed, Sep 04, 2013 at 02:20:11PM +0100, Wei Liu wrote:> On Wed, Sep 04, 2013 at 02:15:51PM +0100, Stefano Stabellini wrote: > > On Mon, 2 Sep 2013, Ian Campbell wrote: > > > On Mon, 2013-09-02 at 16:13 +0100, Wei Liu wrote: > > > > On Mon, Sep 02, 2013 at 04:09:26PM +0100, Ian Campbell wrote: > > > > > On Mon, 2013-09-02 at 16:04 +0100, Wei Liu wrote: > > > > > > On Mon, Sep 02, 2013 at 03:48:43PM +0100, Ian Campbell wrote: > > > > > > > On Mon, 2013-09-02 at 15:43 +0100, Wei Liu wrote: > > > > > > > > Hi, Stefano > > > > > > > > > > > > > > > > I found another bug in the balloon scratch page code. As I didn''t follow > > > > > > > > the discussion on scratch page so I cannot propose a proper fix at the > > > > > > > > moment. > > > > > > > > > > > > > > > > The problem is that in balloon.c:increase_reservation, when a ballooned > > > > > > > > page is resued, it can have a valid P2M entry pointing to the scratch, > > > > > > > > hitting the BUG_ON > > > > > > > > > > > > > > > > BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) && > > > > > > > > phys_to_machine_mapping_valid(pfn)); > > > > > > > > > > > > > > > > As balloon worker might run by a CPU other then the one that returns the > > > > > > > > page, checking pfn_to_mfn(pfn) == local_cpu_scratch_page_mfn wouldn''t > > > > > > > > work. Checking pfn_to_mfn(pfn) belongs to the set of all scratch page > > > > > > > > mfns is not desirable. > > > > > > > > > > > > > > This makes me think that whoever suggested that pfn_to_mfn for a > > > > > > > ballooned page out to return INVALID_MFN was right. > > > > > > > > > > > > > > > > > > > If there are many balloon pages the check can be expensive. > > > > > > > > > > IIRC the suggestion was that the p2m for a ballooned out page would > > > > > contain INVALID_MFN, so the expense is just the lookup you would be > > > > > doing anyway. > > > > > > > > > > > > > That was David''s idea. Stefano was worried that other PVOPS hooks would > > > > need to know about the MFN. I don''t know how much it holds true. Need > > > > some input from Konrad I think. > > > > > > Hrm, aren''t those expected to not be operating on ballooned out pages > > > anyway? Prior to this change those callers would have got INVALID_MFN, I > > > think? > > > > > > My gut feeling is that places which accidentally/unexpectedly have a > > > ballooned out page in their hand have either a virtual address or an mfn > > > (e.g. a cr2 fault address) and not a pfn in their hands, and won''t in > > > general be that interested in the pfn and/or are already geared up to > > > deal with INVALID_MFN because previously that''s what they would have > > > gotten. There''s every chance I''m wrong though. > > > > > > I think we should just remove the BUG_ON: > > > > - if the guest is autotranslate, then the BUG_ON is already irrelevant; > > - if the guest is not autotranslate, then we are sure that the p2m of > > the page is pointing to a scratch page; > > > > either way the BUG_ON is wrong. > > Is it there to guard other misuse of the function? IOW can we be > confident that all calls are legit? > > Messing up the P2M table looks very hard to debug? > > > > > Otherwise could simply implement a is_balloon_scratch_page function that > > checks whether a given pfn corresponds to any of the scratch pages (it > > doesn''t need to be the scratch page of this cpu). > > That''s quite expensive IMHO, especially when you have lots of CPU''s and > lots of ballooned pages.Fortunatly you don''t have to take lock. The PFNs for the scratch pages are set in stone for each vCPU and don''t change (unless the CPU goes down, but then the ''for_each_online_cpu'' would omit said CPU). And I think the balloon driver does everything from one workqueue so the check can done there?> > Wei. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Wed, Sep 04, 2013 at 11:14:30AM -0400, Konrad Rzeszutek Wilk wrote: [...]> > > > > > Otherwise could simply implement a is_balloon_scratch_page function that > > > checks whether a given pfn corresponds to any of the scratch pages (it > > > doesn''t need to be the scratch page of this cpu). > > > > That''s quite expensive IMHO, especially when you have lots of CPU''s and > > lots of ballooned pages. > > Fortunatly you don''t have to take lock. The PFNs for the scratch pages are > set in stone for each vCPU and don''t change (unless the CPU goes down, but > then the ''for_each_online_cpu'' would omit said CPU). > > And I think the balloon driver does everything from one workqueue so > the check can done there?Well, what are the chances that you have 256 CPUs and then need to balloon 2K pages (only 8MB)... andd what''s the frequency you need to do that... Maybe I''m just paranoid to imagine all those extreme use cases. Wei.> > > > Wei. > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xen.org > > http://lists.xen.org/xen-devel
On Wed, Sep 04, 2013 at 02:31:13PM +0100, Stefano Stabellini wrote: [...]> > > > > > - if the guest is autotranslate, then the BUG_ON is already irrelevant; > > > - if the guest is not autotranslate, then we are sure that the p2m of > > > the page is pointing to a scratch page; > > > > > > either way the BUG_ON is wrong. > > > > Is it there to guard other misuse of the function? IOW can we be > > confident that all calls are legit? > > the ballooned_pages list would need to be broken for this BUG_ON to > trigger >Right.> > > Messing up the P2M table looks very hard to debug? > > > > > Otherwise could simply implement a is_balloon_scratch_page function that > > > checks whether a given pfn corresponds to any of the scratch pages (it > > > doesn''t need to be the scratch page of this cpu). > > > > That''s quite expensive IMHO, especially when you have lots of CPU''s and > > lots of ballooned pages. > > as a compromise we could run a check on the pfn only on debug runs?Actually I''m out of idea now so I''m happy to implement a solution that everybody agrees... Wei.
Konrad Rzeszutek Wilk
2013-Sep-04 16:35 UTC
Re: Balloon driver bug in increase_reservation
On Wed, Sep 04, 2013 at 04:42:16PM +0100, Wei Liu wrote:> On Wed, Sep 04, 2013 at 11:14:30AM -0400, Konrad Rzeszutek Wilk wrote: > [...] > > > > > > > > Otherwise could simply implement a is_balloon_scratch_page function that > > > > checks whether a given pfn corresponds to any of the scratch pages (it > > > > doesn''t need to be the scratch page of this cpu). > > > > > > That''s quite expensive IMHO, especially when you have lots of CPU''s and > > > lots of ballooned pages. > > > > Fortunatly you don''t have to take lock. The PFNs for the scratch pages are > > set in stone for each vCPU and don''t change (unless the CPU goes down, but > > then the ''for_each_online_cpu'' would omit said CPU). > > > > And I think the balloon driver does everything from one workqueue so > > the check can done there? > > Well, what are the chances that you have 256 CPUs and then need to > balloon 2K pages (only 8MB)... andd what''s the frequency you need to do > that... Maybe I''m just paranoid to imagine all those extreme use cases.Fortunatly it is a slow process. It does not have to happen immediately so we can also take a lock if need to. Just as long as the lock is not taken in the M2P or P2M code I think we are fine.> > Wei. > > > > > > > Wei. > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xen.org > > > http://lists.xen.org/xen-devel
On Wed, Sep 04, 2013 at 12:35:27PM -0400, Konrad Rzeszutek Wilk wrote:> On Wed, Sep 04, 2013 at 04:42:16PM +0100, Wei Liu wrote: > > On Wed, Sep 04, 2013 at 11:14:30AM -0400, Konrad Rzeszutek Wilk wrote: > > [...] > > > > > > > > > > Otherwise could simply implement a is_balloon_scratch_page function that > > > > > checks whether a given pfn corresponds to any of the scratch pages (it > > > > > doesn''t need to be the scratch page of this cpu). > > > > > > > > That''s quite expensive IMHO, especially when you have lots of CPU''s and > > > > lots of ballooned pages. > > > > > > Fortunatly you don''t have to take lock. The PFNs for the scratch pages are > > > set in stone for each vCPU and don''t change (unless the CPU goes down, but > > > then the ''for_each_online_cpu'' would omit said CPU). > > > > > > And I think the balloon driver does everything from one workqueue so > > > the check can done there? > > > > Well, what are the chances that you have 256 CPUs and then need to > > balloon 2K pages (only 8MB)... andd what''s the frequency you need to do > > that... Maybe I''m just paranoid to imagine all those extreme use cases. > > Fortunatly it is a slow process. It does not have to happen immediately so > we can also take a lock if need to. Just as long as the lock is not taken > in the M2P or P2M code I think we are fine. >There is already a mutex, I think that''s good enough. I will spin a patch with this check. Wei.> > > > Wei. > > > > > > > > > > Wei. > > > > > > > > _______________________________________________ > > > > Xen-devel mailing list > > > > Xen-devel@lists.xen.org > > > > http://lists.xen.org/xen-devel