Ian Pratt
2006-Jun-15 07:06 UTC
[Xen-tools] RE: [Xen-devel] Hi,something about the xentrace tool
> If overflow occurs, it is not handled. The mechanism I implemented was > just designed to drastically reduce the probability of overflow.It does count the number of lost trace messages and add a trace message to that effect though, right? Thanks, Ian> Currently, the trace buffer "high water" mark is set to 50%. That is, > when the hypervisor trace buffer becomes 1/2 full, it sends a soft > interrupt to wake up xenbaked from its blocking select(). If nobody > wakes up to read trace records from the trace buffer, I take that to > mean that nobody cares about the trace records. When somebody doescare,> they will read those records in a timely manner. Obviously, the > hypervisor cannot "block" if there is no room in the trace buffers; In > this case, new trace records simply overwrite old ones, and the oldones> are lost. > > If you encounter a situation where trace records are being generatedtoo> fast, and fill up the trace buffer too quickly, then the simple next > step is to increase the size of the trace buffers. So far, use of the > trace records has not been linked to anything so critical that it''s > necessary to take extraordinary measures to avoid loss of data. > > Rob > > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-tools mailing list Xen-tools@lists.xensource.com http://lists.xensource.com/xen-tools
rickey berkeley
2006-Jun-15 08:58 UTC
[Xen-users] Re: [Xen-devel] Hi,something about the xentrace tool
> > > > If you encounter a situation where trace records are being generated > > fast, and fill up the trace buffer too quickly, then the simple next > > step is to increase the size of the trace buffers. So far, use of the > > trace records has not been linked to anything so critical that it''s > > necessary to take extraordinary measures to avoid loss of data. > > > > Rob >Hi,Rob as xentrace can be used as the performance tracing and debugging tool. you mean when transfer the large amounts of data from kernel space to the user space,xentrace use its own mechanisms to relay the data and balance the transfer speed.And we can enlarge the buffer size if we want to save more tracing raw data. so,dose this mechanisms will effect the system performance evidently?as we know ,copy huge raw data from kernel space to user space will exhaust so much efficiency and system resource. How about make use of relayfs? It is some kind of standardization of the way in which large amounts of data are transferred from kernel space to user space. Anyway,it is just a piece of idea. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Ian Pratt wrote:>> If overflow occurs, it is not handled. The mechanism I implemented was >> just designed to drastically reduce the probability of overflow. >> > > It does count the number of lost trace messages and add a trace message > to that effect though, right? >No, but I''ll add that to the list of things to do in the future. Rob _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
rickey berkeley wrote:> > so,dose this mechanisms will effect the system performance > evidently?as we know ,copy huge raw data from kernel space to user > space will exhaust so much efficiency and system resource.I wouldn''t call the amount of data ''huge''. Even on a very busy system, where there are thousands of trace records being generated every second, that''s still a pretty small amount of data. (The size of a trace record is something like 50 or 60 bytes.) Also, the data is not "copied" from kernel space to user space. There is a shared memory buffer which xen writes into, and the user app reads out of. Memory read speeds are currently in the Gb/s range. So to answer your question, I don''t think that this mechanism affects system performance in any significant way.> > How about make use of relayfs? It is some kind of standardization of > the way in which large amounts of data are transferred from kernel > space to user space. >If the data were only being transferred between the linux kernel and a linux app, then I''d say yeah, relayfs sounds like a cool thing to do. However, the trace records are generated by the xen hypervisor, not the linux kernel. The hypervisor doesn''t have relayfs (or any fs for that matter), so you''re stuck with involving the linux kernel which would read stuff from a shared hypervisor buffer, then present the data to userland via relayfs. Doesn''t sound like a better solution than what we have now. Rob _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 6/15/06, Rob Gardner <rob.gardner@hp.com> wrote:> I wouldn''t call the amount of data ''huge''. Even on a very busy system, > where there are thousands of trace records being generated every second, > that''s still a pretty small amount of data. (The size of a trace record > is something like 50 or 60 bytes.)For the record, I think the trace record size in the trace buffers is probably 32 bytes: struct { unsigned long long rdtsc; /* 8 */ unsigned long event; /* + 4 = 12 */ unsigned long data[5] /* + (4 * 5) = 32 */ }; The size on disk from xentrace is 36 bytes (it adds 4 bytes for the cpu). If someone were really worried about copy time, one could write something which uses raw disks (or, perhaps, the O_DIRECT flag) to DMA data straight from the buffers to the disk. But I''m not really worried about it at this point. :-) Peace, -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 6/15/06, George Dunlap <dunlapg@umich.edu> wrote:> For the record, I think the trace record size in the trace buffers is > probably 32 bytes:On 32-bit architectures, that is... (Sorry for the 32-bit provincialism... haven''t coded on a 64-bit box yet.) -G _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap wrote:> On 6/15/06, Rob Gardner <rob.gardner@hp.com> wrote: >> I wouldn''t call the amount of data ''huge''. Even on a very busy system, >> where there are thousands of trace records being generated every second, >> that''s still a pretty small amount of data. (The size of a trace record >> is something like 50 or 60 bytes.) > > For the record, I think the trace record size in the trace buffers is > probably 32 bytes:You''re right, I was thinking everything is 64 bits these days. In any case, it''s a small amount of data.> If someone were really worried about copy time, one could write > something which uses raw disks (or, perhaps, the O_DIRECT flag) to DMA > data straight from the buffers to the disk.Once again, there is no explicit copying of the data between kernel and user space, so nobody should be worried about it. Rob _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 6/15/06, Rob Gardner <rob.gardner@hp.com> wrote:> > If someone were really worried about copy time, one could write > > something which uses raw disks (or, perhaps, the O_DIRECT flag) to DMA > > data straight from the buffers to the disk. > > Once again, there is no explicit copying of the data between kernel and > user space, so nobody should be worried about it.There''s no copying from the HV to the xentrace process. But there is copying from xentrace to the dom0 kernel for the output file. Some copying is necessary right now, because rather than writing out the pages verbatim, xentrace writes out the pcpu before writing out each record: void write_rec(unsigned int cpu, struct t_rec *rec, FILE *out) { size_t written = 0; written += fwrite(&cpu, sizeof(cpu), 1, out); written += fwrite(rec, sizeof(*rec), 1, out); if ( written != 2 ) { PERROR("Failed to write trace record"); exit(EXIT_FAILURE); } } If we wanted to make it zero copy all the way from the HV to the disk, we could have the xentrace process one stream per cpu, and do whatever''s necessary to use DMA. (Does anyone know if O_DIRECT will do direct DMA, or if one would have to use a raw disk?) But I think we all seem to agree, this is not a high priority. :-) -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap wrote:> > There''s no copying from the HV to the xentrace process. But there is > copying from xentrace to the dom0 kernel for the output file. Some > copying is necessary right now, because rather than writing out the > pages verbatim, xentrace writes out the pcpu before writing out each > record: > > void write_rec(unsigned int cpu, struct t_rec *rec, FILE *out) > { > size_t written = 0; > written += fwrite(&cpu, sizeof(cpu), 1, out); > written += fwrite(rec, sizeof(*rec), 1, out); > if ( written != 2 ) > { > PERROR("Failed to write trace record"); > exit(EXIT_FAILURE); > } > } > > If we wanted to make it zero copy all the way from the HV to the disk, > we could have the xentrace process one stream per cpu, and do > whatever''s necessary to use DMA. (Does anyone know if O_DIRECT will > do direct DMA, or if one would have to use a raw disk?)So you''re saying if we didn''t have to write the cpu number, then we could bypass stdio, and directly do a write() using the trace buffer? And this would be better because it would avoid a memory to memory copy, and use DMA immediately on the trace buffer memory? Do I understand you correctly? Assuming this is what you mean, allow me to correct a slight logic flaw. Stdio is there for a reason; Doing lots of raw I/O using very small buffers is highly inefficient. There''s the overhead of kernel entry/exit and of setting up and tearing down DMA transactions. And writing to a block device will result in I/O''s that are multiples of the devices'' block size, so writing a 32 byte trace record will probably cause a 512-byte block to actually be written to disk. So bypassing stdio in this case will result in lots more disk accesses, lots more dma setup/teardown, and lots more system calls. In other words, the performance is going to horrible. The Stdio library greatly reduces all this overhead by buffering stuff in memory until there''s enough to make a genuine I/O relatively efficient. In this case, the memory copies are intentional and beneficial; We do not want to eliminate them in our quest for "zero copy". Rob _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 6/19/06, Rob Gardner <rob.gardner@hp.com> wrote:> Stdio is there for a reason; Doing lots of raw I/O using > very small buffers is highly inefficient.You misunderstand me. :-) I meant to write out (via DMA) several pages at a time, straight from the HV trace buffers. The default tbuf size in xentrace is 20 pages, so if (as the plan is) xentrace would be notified when it would be half full, we could easily write out 10 pages in one transaction. The tbuf size could be increased if DMA setup/teardown overhead were an issue on that scale. You''re right, for traces that fit in the file cache, buffering is a big win. The copy overhead is negligible, writes to disk are more efficient, and the data will be in the file cache for reading for subsequent analysis. But for traces that won''t fit in the file cache, the best thing would be to get them to disk with as little copying and cache-trashing as possible. Some of my recent traces have been on the order of 10 gigabytes. I haven''t done much to modify xentrace, because I''m not worried about the trace overhead at this point. But I''ve had to pull some tricks to get my analysis tools to run in anything like a reasonable amount of time. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap wrote:> You misunderstand me. :-) I meant to write out (via DMA) several > pages at a time, straight from the HV trace buffers. The default tbuf > size in xentrace is 20 pages, so if (as the plan is) xentrace would be > notified when it would be half full, we could easily write out 10 > pages in one transaction. The tbuf size could be increased if DMA > setup/teardown overhead were an issue on that scale. > ... > Some of my recent traces have been on the order of 10 gigabytes. I > haven''t done much to modify xentrace, because I''m not worried about > the trace overhead at this point. But I''ve had to pull some tricks to > get my analysis tools to run in anything like a reasonable amount of > time.I am glad to discover that I misunderstood you. ;) But I am still having trouble understanding what the actual problem is, or even if one exists. If you have a trace that is 10 gigabytes, that''s several days (maybe weeks) worth of trace records, depending on the rate they''re generated. A memory to memory copy of 10 gigabytes will take mere seconds on any modern machine, and amortized over a few days, I don''t see how it''s worth any work to further reduce that or eliminate it. Is the system so cpu-bound that the loss of a few seconds over several days is that serious? Even compared to the disk I/O to write out 10 gb, which is probably several minutes, I don''t see how the memory copies are a big deal. Perhaps kernel buffer cache effects are noticeable, but again at the data rate you''re talking about, the cache will only get completely purged once every 5 or 10 hours. If your analysis tools take a long time to run, I''d guess it''s because of the size of the data, not because system resources are being hogged by xentrace; If you are generating that much data, maybe you consider methods to reduce it. Take a look the the trace code (xen/common/trace.c) and you''ll see that there is a facility to mask out tracing of certain events, classes of events, and cpu''s. You might use this to drastically reduce the number of trace records generated. For instance, if you are not interested in tracing I/O related events, you don''t want to be storing TRC_MEM records, which account for a large percentage of the trace records generated on a busy system. Rob _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 6/19/06, Rob Gardner <rob.gardner@hp.com> wrote:> I am glad to discover that I misunderstood you. ;) But I am still having > trouble understanding what the actual problem is, or even if one exists.Well, I ran some tests, and no problem exists, yet. Running the following: # time xentrace -e 0x81000 /tmp/test22-passmark.trace change evtmask to 0x81000 real 7m15.456s user 0m0.080s sys 0m0.050s # ls -l /tmp/test22-passmark.trace -rw-r--r-- 1 root root 2654091720 Jun 21 14:49 /tmp/test22-passmark.trace So although 2.6 gigabytes was generated in 7 minutes, the total time spent in user and system (if the numbers time report are accurate) was less than .13 seconds. The only potential issues would be with cache trashing -- both the buffer cache (from plain writes to the file), and the cpu caches (from copying the data). If anyone finds a workload this is a problem for, we can look at it then. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel