hi all, recently we met an unbelievable weird memory problem running on dom0, test case is very simple, code is as following: #define BUF_SIZE 4096 #define IO_PATTERN 0xab int main(int argc, char *argv[]) { void *buf; char cmp_buf[BUF_SIZE]; int err = 0; buf = malloc(BUF_SIZE); if (!buf) { fprintf(stderr, "error %s during %s\n", strerror(-err), "malloc"); return 1; } memset(buf, IO_PATTERN, BUF_SIZE); memset(cmp_buf, IO_PATTERN, BUF_SIZE); if (memcmp(buf, cmp_buf, BUF_SIZE)) { unsigned long long *ubuf = (unsigned long long *)buf; int i; for (i = 0; i < BUF_SIZE / sizeof(unsigned long long); i++) printf("%d: 0x%llx\n", i, ubuf[i]); return 2; } return 0; } memcmp failure occurs while the case is running on 500 machines with Xen, each for billion times. error log has two results, one is 0x0, it shows buf is zero, the other one is 0xabababa...ababa, it shows cmp_buf isn''t 0xabab..ab both of error log shows either buf or cmp_buf is all incorrect. However, this case pass when we run on native linux kernel(2.6.32) without Xen. we suspect maybe it''s relevent to pvops behavior of dom0. we''re not sure whether it''s a bug fixed in newer version of kernel and xen, so we have tried diffrent version of Xen and dom0 including Xen4.0.1+kernel2.6.32/3.0/3.11 and Xen4.2 + kernel2.6.32, unfortunately, all of these failed. we found PAT behaves differenly between linux and xen, so we try to add nopat into command line of kernel 3.11, and it also failed. now we''re blocked, realy need some help. any advice will be appreciated thanks in advance regards, wanjia _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 22/10/13 16:41, Alice Wan wrote:> hi all, > > recently we met an unbelievable weird memory problem running on > dom0, test case is very simple, code is as following: > > #define BUF_SIZE 4096 > #define IO_PATTERN 0xab > > int main(int argc, char *argv[]) > { > void *buf; > char cmp_buf[BUF_SIZE]; > int err = 0; > > buf = malloc(BUF_SIZE); > if (!buf) { > fprintf(stderr, "error %s during %s\n", > strerror(-err), > "malloc"); > return 1; > } > memset(buf, IO_PATTERN, BUF_SIZE); > memset(cmp_buf, IO_PATTERN, BUF_SIZE); > > if (memcmp(buf, cmp_buf, BUF_SIZE)) { > unsigned long long *ubuf = (unsigned long long *)buf; > int i; > > for (i = 0; i < BUF_SIZE / sizeof(unsigned long long); > i++) > printf("%d: 0x%llx\n", i, ubuf[i]); > > return 2; > } > > return 0; > } > > memcmp failure occurs while the case is running on 500 machines > with Xen, each for billion times. > error log has two results, one is 0x0, it shows buf is zero, the > other one is 0xabababa...ababa, it shows cmp_buf isn''t 0xabab..ab > > both of error log shows either buf or cmp_buf is all incorrect. > > However, this case pass when we run on native linux kernel(2.6.32) > without Xen. > > we suspect maybe it''s relevent to pvops behavior of dom0. > > we''re not sure whether it''s a bug fixed in newer version of kernel > and xen, so we have tried diffrent version of Xen and dom0 including > Xen4.0.1+kernel2.6.32/3.0/3.11 and Xen4.2 + kernel2.6.32, > unfortunately, all of these failed. > > we found PAT behaves differenly between linux and xen, so we try > to add nopat into command line of kernel 3.11, and it also failed. > > now we''re blocked, realy need some help. > > any advice will be appreciated > > thanks in advance > > > > regards, > wanjiaPicking randomly at some ideas: Do you have ballooning enabled? At the time of a failure, is there anything interesting in the Linux or Xen dmesg? Are you running a debug version of Linux or Xen? ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
well, no balloon, command line has dom0_mem no any useful dmesg, xm dmesg kernel haven''t config DEBUG and maybe if we open DEBUG, this problem can''t be reproduced. has any ideas about pte_flags ? some info about mtrr reg00: base=0x0ffc00000 ( 4092MB), size= 4MB, count=1: write-protect reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable reg02: base=0x0c0000000 ( 3072MB), size= 512MB, count=1: uncachable reg03: base=0x0e0000000 ( 3584MB), size= 256MB, count=1: uncachable reg04: base=0x0f0000000 ( 3840MB), size= 128MB, count=1: uncachable reg05: base=0x0f8000000 ( 3968MB), size= 64MB, count=1: uncachable reg06: base=0x0fc000000 ( 4032MB), size= 32MB, count=1: uncachable reg07: base=0x0fec00000 ( 4076MB), size= 4MB, count=1: uncachable regards, wanjia 2013/10/23 Andrew Cooper <andrew.cooper3@citrix.com>> On 22/10/13 16:41, Alice Wan wrote: > > hi all, > > recently we met an unbelievable weird memory problem running on > dom0, test case is very simple, code is as following: > > #define BUF_SIZE 4096 > #define IO_PATTERN 0xab > > int main(int argc, char *argv[]) > { > void *buf; > char cmp_buf[BUF_SIZE]; > int err = 0; > > buf = malloc(BUF_SIZE); > if (!buf) { > fprintf(stderr, "error %s during %s\n", > strerror(-err), > "malloc"); > return 1; > } > memset(buf, IO_PATTERN, BUF_SIZE); > memset(cmp_buf, IO_PATTERN, BUF_SIZE); > > if (memcmp(buf, cmp_buf, BUF_SIZE)) { > unsigned long long *ubuf = (unsigned long long *)buf; > int i; > > for (i = 0; i < BUF_SIZE / sizeof(unsigned long long); > i++) > printf("%d: 0x%llx\n", i, ubuf[i]); > > return 2; > } > > return 0; > } > > memcmp failure occurs while the case is running on 500 machines with > Xen, each for billion times. > error log has two results, one is 0x0, it shows buf is zero, the > other one is 0xabababa...ababa, it shows cmp_buf isn''t 0xabab..ab > > both of error log shows either buf or cmp_buf is all incorrect. > > However, this case pass when we run on native linux kernel(2.6.32) > without Xen. > > we suspect maybe it''s relevent to pvops behavior of dom0. > > we''re not sure whether it''s a bug fixed in newer version of kernel > and xen, so we have tried diffrent version of Xen and dom0 including > Xen4.0.1+kernel2.6.32/3.0/3.11 and Xen4.2 + kernel2.6.32, unfortunately, > all of these failed. > > we found PAT behaves differenly between linux and xen, so we try to > add nopat into command line of kernel 3.11, and it also failed. > > now we''re blocked, realy need some help. > > any advice will be appreciated > > thanks in advance > > > > regards, > wanjia > > > Picking randomly at some ideas: > > Do you have ballooning enabled? > > At the time of a failure, is there anything interesting in the Linux or > Xen dmesg? > > Are you running a debug version of Linux or Xen? > > ~Andrew >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Wed, 2013-10-23 at 12:36 +0800, Alice Wan wrote:> kernel haven''t config DEBUG > > and maybe if we open DEBUG, this problem can''t be reproduced.I think it would be worth trying it in order to confirm or deny this rather than just supposing it might cause problems. Ian.
On Wed, Oct 23, 2013 at 5:36 AM, Alice Wan <wanjia19870902@gmail.com> wrote:> well, no balloon, command line has dom0_mem> > no any useful dmesg, xm dmesg > > kernel haven''t config DEBUG > > and maybe if we open DEBUG, this problem can''t be reproduced. > > has any ideas about pte_flags ? > > some info about mtrr > > reg00: base=0x0ffc00000 ( 4092MB), size= 4MB, count=1: write-protect > reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable > reg02: base=0x0c0000000 ( 3072MB), size= 512MB, count=1: uncachable > reg03: base=0x0e0000000 ( 3584MB), size= 256MB, count=1: uncachable > reg04: base=0x0f0000000 ( 3840MB), size= 128MB, count=1: uncachable > reg05: base=0x0f8000000 ( 3968MB), size= 64MB, count=1: uncachable > reg06: base=0x0fc000000 ( 4032MB), size= 32MB, count=1: uncachable > reg07: base=0x0fec00000 ( 4076MB), size= 4MB, count=1: uncachableAny updates on this? -George
yeah, at last we found it''s specific to glibc memset implementation, which is optimized with sse instructions. the detail reason is as following http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg00600.html thank you all for any advices. regards, wanjia 2013/10/31 George Dunlap <George.Dunlap@eu.citrix.com>> On Wed, Oct 23, 2013 at 5:36 AM, Alice Wan <wanjia19870902@gmail.com> > wrote: > > well, no balloon, command line has dom0_mem> > > > no any useful dmesg, xm dmesg > > > > kernel haven''t config DEBUG > > > > and maybe if we open DEBUG, this problem can''t be reproduced. > > > > has any ideas about pte_flags ? > > > > some info about mtrr > > > > reg00: base=0x0ffc00000 ( 4092MB), size= 4MB, count=1: write-protect > > reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable > > reg02: base=0x0c0000000 ( 3072MB), size= 512MB, count=1: uncachable > > reg03: base=0x0e0000000 ( 3584MB), size= 256MB, count=1: uncachable > > reg04: base=0x0f0000000 ( 3840MB), size= 128MB, count=1: uncachable > > reg05: base=0x0f8000000 ( 3968MB), size= 64MB, count=1: uncachable > > reg06: base=0x0fc000000 ( 4032MB), size= 32MB, count=1: uncachable > > reg07: base=0x0fec00000 ( 4076MB), size= 4MB, count=1: uncachable > > Any updates on this? > > -George >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel