Yoshihiro YUNOMAE
2012-Jul-24 02:36 UTC
[RFC PATCH 0/6] virtio-trace: Support virtio-trace
Hi All, The following patch set provides a low-overhead system for collecting kernel tracing data of guests by a host in a virtualization environment. A guest OS generally shares some devices with other guests or a host, so reasons of any problems occurring in a guest may be from other guests or a host. Then, to collect some tracing data of a number of guests and a host is needed when some problems occur in a virtualization environment. One of methods to realize that is to collect tracing data of guests in a host. To do this, network is generally used. However, high load will be taken to applications on guests using network I/O because there are many network stack layers. Therefore, a communication method for collecting the data without using network is needed. We submitted a patch set of "IVRing", a ring-buffer driver constructed on Inter-VM shared memory (IVShmem), to LKML http://lwn.net/Articles/500304/ in this June. IVRing and the IVRing reader use POSIX shared memory each other without using network, so a low-overhead system for collecting guest tracing data is realized. However, this patch set has some problems as follows: - use IVShmem instead of virtio - create a new ring-buffer without using existing ring-buffer in kernel - scalability -- not support SMP environment -- buffer size limitation -- not support live migration (maybe difficult for realize this) Therefore, we propose a new system "virtio-trace", which uses enhanced virtio-serial and existing ring-buffer of ftrace, for collecting guest kernel tracing data. In this system, there are 5 main components: (1) Ring-buffer of ftrace in a guest - When trace agent reads ring-buffer, a page is removed from ring-buffer. (2) Trace agent in the guest - Splice the page of ring-buffer to read_pipe using splice() without memory copying. Then, the page is spliced from write_pipe to virtio without memory copying. (3) Virtio-console driver in the guest - Pass the page to virtio-ring (4) Virtio-serial bus in QEMU - Copy the page to kernel pipe (5) Reader in the host - Read guest tracing data via FIFO(named pipe) ***Evaluation*** When a host collects tracing data of a guest, the performance of using virtio-trace is compared with that of using native(just running ftrace), IVRing, and virtio-serial(normal method of read/write). <environment> The overview of this evaluation is as follows: (a) A guest on a KVM is prepared. - The guest is dedicated one physical CPU as a virtual CPU(VCPU). (b) The guest starts to write tracing data to ring-buffer of ftrace. - The probe points are all trace points of sched, timer, and kmem. (c) Writing trace data, dhrystone 2 in UNIX bench is executed as a benchmark tool in the guest. - Dhrystone 2 intends system performance by repeating integer arithmetic as a score. - Since higher score equals to better system performance, if the score decrease based on bare environment, it indicates that any operation disturbs the integer arithmetic. Then, we define the overhead of transporting trace data is calculated as follows: OVERHEAD = (1 - SCORE_OF_A_METHOD/NATIVE_SCORE) * 100. The performance of each method is compared as follows: [1] Native - only recording trace data to ring-buffer on a guest [2] Virtio-trace - running a trace agent on a guest - a reader on a host opens FIFO using cat command [3] IVRing - A SystemTap script in a guest records trace data to IVRing. -- probe points are same as ftrace. [4] Virtio-serial(normal) - A reader(using cat) on a guest output trace data to a host using standard output via virtio-serial. Other information is as follows: - host kernel: 3.3.7-1 (Fedora16) CPU: Intel Xeon x5660 at 2.80GHz(12core) Memory: 48GB - guest(only booting one guest) kernel: 3.5.0-rc4+ (Fedora16) CPU: 1VCPU(dedicated) Memory: 1GB <result> 3 patterns based on the bare environment were indicated as follows: Scores overhead against [0] Native [0] Native: 28807569.5 - [1] Virtio-trace: 28685049.5 0.43% [2] IVRing: 28418595.5 1.35% [3] Virtio-serial: 13262258.7 53.96% ***Just enhancement ideas*** - Support for trace-cmd - Support for 9pfs protocol - Support for non-blocking mode in QEMU - Make "vhost-serial" Thank you, --- Masami Hiramatsu (5): virtio/console: Allocate scatterlist according to the current pipe size ftrace: Allow stealing pages from pipe buffer virtio/console: Wait until the port is ready on splice virtio/console: Add a failback for unstealable pipe buffer virtio/console: Add splice_write support Yoshihiro YUNOMAE (1): tools: Add guest trace agent as a user tool drivers/char/virtio_console.c | 198 ++++++++++++++++++-- kernel/trace/trace.c | 8 - tools/virtio/virtio-trace/Makefile | 14 + tools/virtio/virtio-trace/README | 118 ++++++++++++ tools/virtio/virtio-trace/trace-agent-ctl.c | 137 ++++++++++++++ tools/virtio/virtio-trace/trace-agent-rw.c | 192 +++++++++++++++++++ tools/virtio/virtio-trace/trace-agent.c | 270 +++++++++++++++++++++++++++ tools/virtio/virtio-trace/trace-agent.h | 75 ++++++++ 8 files changed, 985 insertions(+), 27 deletions(-) create mode 100644 tools/virtio/virtio-trace/Makefile create mode 100644 tools/virtio/virtio-trace/README create mode 100644 tools/virtio/virtio-trace/trace-agent-ctl.c create mode 100644 tools/virtio/virtio-trace/trace-agent-rw.c create mode 100644 tools/virtio/virtio-trace/trace-agent.c create mode 100644 tools/virtio/virtio-trace/trace-agent.h -- Yoshihiro YUNOMAE Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: yoshihiro.yunomae.ez at hitachi.com
Yoshihiro YUNOMAE
2012-Jul-24 02:37 UTC
[RFC PATCH 1/6] virtio/console: Add splice_write support
From: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> Enable to use splice_write from pipe to virtio-console port. This steals pages from pipe and directly send it to host. Note that this may accelerate only the guest to host path. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> Cc: Amit Shah <amit.shah at redhat.com> Cc: Arnd Bergmann <arnd at arndb.de> Cc: Greg Kroah-Hartman <gregkh at linuxfoundation.org> --- drivers/char/virtio_console.c | 136 +++++++++++++++++++++++++++++++++++++++-- 1 files changed, 128 insertions(+), 8 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index cdf2f54..fe31b2f 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -24,6 +24,8 @@ #include <linux/err.h> #include <linux/freezer.h> #include <linux/fs.h> +#include <linux/splice.h> +#include <linux/pagemap.h> #include <linux/init.h> #include <linux/list.h> #include <linux/poll.h> @@ -227,6 +229,7 @@ struct port { bool guest_connected; }; +#define MAX_SPLICE_PAGES 32 /* This is the very early arch-specified put chars function. */ static int (*early_put_chars)(u32, const char *, int); @@ -474,26 +477,52 @@ static ssize_t send_control_msg(struct port *port, unsigned int event, return 0; } +struct buffer_token { + union { + void *buf; + struct scatterlist *sg; + } u; + bool sgpages; +}; + +static void reclaim_sg_pages(struct scatterlist *sg) +{ + int i; + struct page *page; + + for (i = 0; i < MAX_SPLICE_PAGES; i++) { + page = sg_page(&sg[i]); + if (!page) + break; + put_page(page); + } + kfree(sg); +} + /* Callers must take the port->outvq_lock */ static void reclaim_consumed_buffers(struct port *port) { - void *buf; + struct buffer_token *tok; unsigned int len; if (!port->portdev) { /* Device has been unplugged. vqs are already gone. */ return; } - while ((buf = virtqueue_get_buf(port->out_vq, &len))) { - kfree(buf); + while ((tok = virtqueue_get_buf(port->out_vq, &len))) { + if (tok->sgpages) + reclaim_sg_pages(tok->u.sg); + else + kfree(tok->u.buf); + kfree(tok); port->outvq_full = false; } } -static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count, - bool nonblock) +static ssize_t __send_to_port(struct port *port, struct scatterlist *sg, + int nents, size_t in_count, + struct buffer_token *tok, bool nonblock) { - struct scatterlist sg[1]; struct virtqueue *out_vq; ssize_t ret; unsigned long flags; @@ -505,8 +534,7 @@ static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count, reclaim_consumed_buffers(port); - sg_init_one(sg, in_buf, in_count); - ret = virtqueue_add_buf(out_vq, sg, 1, 0, in_buf, GFP_ATOMIC); + ret = virtqueue_add_buf(out_vq, sg, nents, 0, tok, GFP_ATOMIC); /* Tell Host to go! */ virtqueue_kick(out_vq); @@ -544,6 +572,37 @@ done: return in_count; } +static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count, + bool nonblock) +{ + struct scatterlist sg[1]; + struct buffer_token *tok; + + tok = kmalloc(sizeof(*tok), GFP_ATOMIC); + if (!tok) + return -ENOMEM; + tok->sgpages = false; + tok->u.buf = in_buf; + + sg_init_one(sg, in_buf, in_count); + + return __send_to_port(port, sg, 1, in_count, tok, nonblock); +} + +static ssize_t send_pages(struct port *port, struct scatterlist *sg, int nents, + size_t in_count, bool nonblock) +{ + struct buffer_token *tok; + + tok = kmalloc(sizeof(*tok), GFP_ATOMIC); + if (!tok) + return -ENOMEM; + tok->sgpages = true; + tok->u.sg = sg; + + return __send_to_port(port, sg, nents, in_count, tok, nonblock); +} + /* * Give out the data that's requested from the buffer that we have * queued up. @@ -725,6 +784,66 @@ out: return ret; } +struct sg_list { + unsigned int n; + size_t len; + struct scatterlist *sg; +}; + +static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf, + struct splice_desc *sd) +{ + struct sg_list *sgl = sd->u.data; + unsigned int len = 0; + + if (sgl->n == MAX_SPLICE_PAGES) + return 0; + + /* Try lock this page */ + if (buf->ops->steal(pipe, buf) == 0) { + /* Get reference and unlock page for moving */ + get_page(buf->page); + unlock_page(buf->page); + + len = min(buf->len, sd->len); + sg_set_page(&(sgl->sg[sgl->n]), buf->page, len, buf->offset); + sgl->n++; + sgl->len += len; + } + + return len; +} + +/* Faster zero-copy write by splicing */ +static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe, + struct file *filp, loff_t *ppos, + size_t len, unsigned int flags) +{ + struct port *port = filp->private_data; + struct sg_list sgl; + ssize_t ret; + struct splice_desc sd = { + .total_len = len, + .flags = flags, + .pos = *ppos, + .u.data = &sgl, + }; + + sgl.n = 0; + sgl.len = 0; + sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES, + GFP_ATOMIC); + if (unlikely(!sgl.sg)) + return -ENOMEM; + + sg_init_table(sgl.sg, MAX_SPLICE_PAGES); + ret = __splice_from_pipe(pipe, &sd, pipe_to_sg); + if (likely(ret > 0)) + ret = send_pages(port, sgl.sg, sgl.n, sgl.len, true); + + return ret; +} + static unsigned int port_fops_poll(struct file *filp, poll_table *wait) { struct port *port; @@ -856,6 +975,7 @@ static const struct file_operations port_fops = { .open = port_fops_open, .read = port_fops_read, .write = port_fops_write, + .splice_write = port_fops_splice_write, .poll = port_fops_poll, .release = port_fops_release, .fasync = port_fops_fasync,
Yoshihiro YUNOMAE
2012-Jul-24 02:37 UTC
[RFC PATCH 2/6] virtio/console: Add a failback for unstealable pipe buffer
From: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> Add a failback memcpy path for unstealable pipe buffer. If buf->ops->steal() fails, virtio-serial tries to copy the page contents to an allocated page, instead of just failing splice(). Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> Cc: Amit Shah <amit.shah at redhat.com> Cc: Arnd Bergmann <arnd at arndb.de> Cc: Greg Kroah-Hartman <gregkh at linuxfoundation.org> --- drivers/char/virtio_console.c | 28 +++++++++++++++++++++++++--- 1 files changed, 25 insertions(+), 3 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index fe31b2f..911cb3e 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -794,7 +794,7 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf, struct splice_desc *sd) { struct sg_list *sgl = sd->u.data; - unsigned int len = 0; + unsigned int offset, len; if (sgl->n == MAX_SPLICE_PAGES) return 0; @@ -807,9 +807,31 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf, len = min(buf->len, sd->len); sg_set_page(&(sgl->sg[sgl->n]), buf->page, len, buf->offset); - sgl->n++; - sgl->len += len; + } else { + /* Failback to copying a page */ + struct page *page = alloc_page(GFP_KERNEL); + char *src = buf->ops->map(pipe, buf, 1); + char *dst; + + if (!page) + return -ENOMEM; + dst = kmap(page); + + offset = sd->pos & ~PAGE_MASK; + + len = sd->len; + if (len + offset > PAGE_SIZE) + len = PAGE_SIZE - offset; + + memcpy(dst + offset, src + buf->offset, len); + + kunmap(page); + buf->ops->unmap(pipe, buf, src); + + sg_set_page(&(sgl->sg[sgl->n]), page, len, offset); } + sgl->n++; + sgl->len += len; return len; }
Yoshihiro YUNOMAE
2012-Jul-24 02:37 UTC
[RFC PATCH 3/6] virtio/console: Wait until the port is ready on splice
From: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> Wait if the port is not connected or full on splice like as write is doing. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> Cc: Amit Shah <amit.shah at redhat.com> Cc: Arnd Bergmann <arnd at arndb.de> Cc: Greg Kroah-Hartman <gregkh at linuxfoundation.org> --- drivers/char/virtio_console.c | 39 +++++++++++++++++++++++++++------------ 1 files changed, 27 insertions(+), 12 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index 911cb3e..e49d435 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -724,6 +724,26 @@ static ssize_t port_fops_read(struct file *filp, char __user *ubuf, return fill_readbuf(port, ubuf, count, true); } +static int wait_port_writable(struct port *port, bool nonblock) +{ + int ret; + + if (will_write_block(port)) { + if (nonblock) + return -EAGAIN; + + ret = wait_event_freezable(port->waitqueue, + !will_write_block(port)); + if (ret < 0) + return ret; + } + /* Port got hot-unplugged. */ + if (!port->guest_connected) + return -ENODEV; + + return 0; +} + static ssize_t port_fops_write(struct file *filp, const char __user *ubuf, size_t count, loff_t *offp) { @@ -740,18 +760,9 @@ static ssize_t port_fops_write(struct file *filp, const char __user *ubuf, nonblock = filp->f_flags & O_NONBLOCK; - if (will_write_block(port)) { - if (nonblock) - return -EAGAIN; - - ret = wait_event_freezable(port->waitqueue, - !will_write_block(port)); - if (ret < 0) - return ret; - } - /* Port got hot-unplugged. */ - if (!port->guest_connected) - return -ENODEV; + ret = wait_port_writable(port, nonblock); + if (ret < 0) + return ret; count = min((size_t)(32 * 1024), count); @@ -851,6 +862,10 @@ static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe, .u.data = &sgl, }; + ret = wait_port_writable(port, filp->f_flags & O_NONBLOCK); + if (ret < 0) + return ret; + sgl.n = 0; sgl.len = 0; sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES,
Yoshihiro YUNOMAE
2012-Jul-24 02:37 UTC
[RFC PATCH 4/6] ftrace: Allow stealing pages from pipe buffer
From: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> Use generic steal operation on pipe buffer to allow stealing ring buffer's read page from pipe buffer. Note that this could reduce the performance of splice on the splice_write side operation without affinity setting. Since the ring buffer's read pages are allocated on the tracing-node, but the splice user does not always execute splice write side operation on the same node. In this case, the page will be accessed from the another node. Thus, it is strongly recommended to assign the splicing thread to corresponding node. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> Cc: Steven Rostedt <rostedt at goodmis.org> Cc: Frederic Weisbecker <fweisbec at gmail.com> Cc: Ingo Molnar <mingo at redhat.com> --- kernel/trace/trace.c | 8 +------- 1 files changed, 1 insertions(+), 7 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index a120f98..ae01930 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4194,12 +4194,6 @@ static void buffer_pipe_buf_release(struct pipe_inode_info *pipe, buf->private = 0; } -static int buffer_pipe_buf_steal(struct pipe_inode_info *pipe, - struct pipe_buffer *buf) -{ - return 1; -} - static void buffer_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { @@ -4215,7 +4209,7 @@ static const struct pipe_buf_operations buffer_pipe_buf_ops = { .unmap = generic_pipe_buf_unmap, .confirm = generic_pipe_buf_confirm, .release = buffer_pipe_buf_release, - .steal = buffer_pipe_buf_steal, + .steal = generic_pipe_buf_steal, .get = buffer_pipe_buf_get, };
Yoshihiro YUNOMAE
2012-Jul-24 02:37 UTC
[RFC PATCH 5/6] virtio/console: Allocate scatterlist according to the current pipe size
From: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> Allocate scatterlist according to the current pipe size. This allows splicing bigger buffer if the pipe size has been changed by fcntl. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> Cc: Amit Shah <amit.shah at redhat.com> Cc: Arnd Bergmann <arnd at arndb.de> Cc: Greg Kroah-Hartman <gregkh at linuxfoundation.org> --- drivers/char/virtio_console.c | 23 ++++++++++++----------- 1 files changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index e49d435..f5063d5 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -229,7 +229,6 @@ struct port { bool guest_connected; }; -#define MAX_SPLICE_PAGES 32 /* This is the very early arch-specified put chars function. */ static int (*early_put_chars)(u32, const char *, int); @@ -482,15 +481,16 @@ struct buffer_token { void *buf; struct scatterlist *sg; } u; - bool sgpages; + /* If sgpages == 0 then buf is used, else sg is used */ + unsigned int sgpages; }; -static void reclaim_sg_pages(struct scatterlist *sg) +static void reclaim_sg_pages(struct scatterlist *sg, unsigned int nrpages) { int i; struct page *page; - for (i = 0; i < MAX_SPLICE_PAGES; i++) { + for (i = 0; i < nrpages; i++) { page = sg_page(&sg[i]); if (!page) break; @@ -511,7 +511,7 @@ static void reclaim_consumed_buffers(struct port *port) } while ((tok = virtqueue_get_buf(port->out_vq, &len))) { if (tok->sgpages) - reclaim_sg_pages(tok->u.sg); + reclaim_sg_pages(tok->u.sg, tok->sgpages); else kfree(tok->u.buf); kfree(tok); @@ -581,7 +581,7 @@ static ssize_t send_buf(struct port *port, void *in_buf, size_t in_count, tok = kmalloc(sizeof(*tok), GFP_ATOMIC); if (!tok) return -ENOMEM; - tok->sgpages = false; + tok->sgpages = 0; tok->u.buf = in_buf; sg_init_one(sg, in_buf, in_count); @@ -597,7 +597,7 @@ static ssize_t send_pages(struct port *port, struct scatterlist *sg, int nents, tok = kmalloc(sizeof(*tok), GFP_ATOMIC); if (!tok) return -ENOMEM; - tok->sgpages = true; + tok->sgpages = nents; tok->u.sg = sg; return __send_to_port(port, sg, nents, in_count, tok, nonblock); @@ -797,6 +797,7 @@ out: struct sg_list { unsigned int n; + unsigned int size; size_t len; struct scatterlist *sg; }; @@ -807,7 +808,7 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, struct pipe_buffer *buf, struct sg_list *sgl = sd->u.data; unsigned int offset, len; - if (sgl->n == MAX_SPLICE_PAGES) + if (sgl->n == sgl->size) return 0; /* Try lock this page */ @@ -868,12 +869,12 @@ static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe, sgl.n = 0; sgl.len = 0; - sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES, - GFP_ATOMIC); + sgl.size = pipe->nrbufs; + sgl.sg = kmalloc(sizeof(struct scatterlist) * sgl.size, GFP_ATOMIC); if (unlikely(!sgl.sg)) return -ENOMEM; - sg_init_table(sgl.sg, MAX_SPLICE_PAGES); + sg_init_table(sgl.sg, sgl.size); ret = __splice_from_pipe(pipe, &sd, pipe_to_sg); if (likely(ret > 0)) ret = send_pages(port, sgl.sg, sgl.n, sgl.len, true);
Yoshihiro YUNOMAE
2012-Jul-24 02:37 UTC
[RFC PATCH 6/6] tools: Add guest trace agent as a user tool
This patch adds a user tool, "trace agent" for sending trace data of a guest to a Host in low overhead. This agent has the following functions: - splice a page of ring-buffer to read_pipe without memory copying - splice the page from write_pipe to virtio-console without memory copying - write trace data to stdout by using -o option - controlled by start/stop orders from a Host Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez at hitachi.com> --- tools/virtio/virtio-trace/Makefile | 14 + tools/virtio/virtio-trace/README | 118 ++++++++++++ tools/virtio/virtio-trace/trace-agent-ctl.c | 137 ++++++++++++++ tools/virtio/virtio-trace/trace-agent-rw.c | 192 +++++++++++++++++++ tools/virtio/virtio-trace/trace-agent.c | 270 +++++++++++++++++++++++++++ tools/virtio/virtio-trace/trace-agent.h | 75 ++++++++ 6 files changed, 806 insertions(+), 0 deletions(-) create mode 100644 tools/virtio/virtio-trace/Makefile create mode 100644 tools/virtio/virtio-trace/README create mode 100644 tools/virtio/virtio-trace/trace-agent-ctl.c create mode 100644 tools/virtio/virtio-trace/trace-agent-rw.c create mode 100644 tools/virtio/virtio-trace/trace-agent.c create mode 100644 tools/virtio/virtio-trace/trace-agent.h diff --git a/tools/virtio/virtio-trace/Makefile b/tools/virtio/virtio-trace/Makefile new file mode 100644 index 0000000..ef3adfc --- /dev/null +++ b/tools/virtio/virtio-trace/Makefile @@ -0,0 +1,14 @@ +CC = gcc +CFLAGS = -O2 -Wall +LFLAG = -lpthread + +all: trace-agent + +.c.o: + $(CC) $(CFLAGS) $(LFLAG) -c $^ -o $@ + +trace-agent: trace-agent.o trace-agent-ctl.o trace-agent-rw.o + $(CC) $(CFLAGS) $(LFLAG) -o $@ $^ + +clean: + rm -f *.o trace-agent diff --git a/tools/virtio/virtio-trace/README b/tools/virtio/virtio-trace/README new file mode 100644 index 0000000..b64845b --- /dev/null +++ b/tools/virtio/virtio-trace/README @@ -0,0 +1,118 @@ +Trace Agent for virtio-trace +===========================+ +Trace agent is a user tool for sending trace data of a guest to a Host in low +overhead. Trace agent has the following functions: + - splice a page of ring-buffer to read_pipe without memory copying + - splice the page from write_pipe to virtio-console without memory copying + - write trace data to stdout by using -o option + - controlled by start/stop orders from a Host + +The trace agent operates as follows: + 1) Initialize all structures. + 2) Create a read/write thread per CPU. Each thread is bound to a CPU. + The read/write threads hold it. + 3) A controller thread does poll() for a start order of a host. + 4) After the controller of the trace agent receives a start order from a host, + the controller wake read/write threads. + 5) The read/write threads start to read trace data from ring-buffers and + write the data to virtio-serial. + 6) If the controller receives a stop order from a host, the read/write threads + stop to read trace data. + + +Files +====+ +README: this file +Makefile: Makefile of trace agent for virtio-trace +trace-agent.c: includes main function, sets up for operating trace agent +trace-agent.h: includes all structures and some macros +trace-agent-ctl.c: includes controller function for read/write threads +trace-agent-rw.c: includes read/write threads function + + +Setup +====+ +To use this trace agent for virtio-trace, we need to prepare some virtio-serial +I/Fs. + +1) Make FIFO in a host + virtio-trace uses virtio-serial pipe as trace data paths as to the number +of CPUs and a control path, so FIFO (named pipe) should be created as follows: + # mkdir /tmp/virtio-trace/ + # mkfifo /tmp/virtio-trace/trace-path-cpu{0,1,2,...,X}.{in,out} + # mkfifo /tmp/virtio-trace/agent-ctl-path.{in,out} + +For example, if a guest use three CPUs, the names are + trace-path-cpu{0,1,2}.{in.out} +and + agent-ctl-path.{in,out}. + +2) Set up of virtio-serial pipe in a host + Add qemu option to use virtio-serial pipe. + + ##virtio-serial device## + -device virtio-serial-pci,id=virtio-serial0\ + ##control path## + -chardev pipe,id=charchannel0,path=/tmp/virtio-trace/agent-ctl-path\ + -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,\ + id=channel0,name=agent-ctl-path\ + ##data path## + -chardev pipe,id=charchannel1,path=/tmp/virtio-trace/trace-path-cpu0\ + -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel0,\ + id=channel1,name=trace-path-cpu0\ + ... + +If you manage guests with libvirt, add the following tags to domain XML files. +Then, libvirt passes the same command option to qemu. + + <channel type='pipe'> + <source path='/tmp/virtio-trace/agent-ctl-path'/> + <target type='virtio' name='agent-ctl-path'/> + <address type='virtio-serial' controller='0' bus='0' port='0'/> + </channel> + <channel type='pipe'> + <source path='/tmp/virtio-trace/trace-path-cpu0'/> + <target type='virtio' name='trace-path-cpu0'/> + <address type='virtio-serial' controller='0' bus='0' port='1'/> + </channel> + ... +Here, chardev names are restricted to trace-path-cpuX and agent-ctl-path. For +example, if a guest use three CPUs, chardev names should be trace-path-cpu0, +trace-path-cpu1, trace-path-cpu2, and agent-ctl-path. + +3) Boot the guest + You can find some chardev in /dev/virtio-ports/ in the guest. + + +Run +==+ +0) Build trace agent in a guest + $ make + +1) Enable ftrace in the guest + <Example> + # echo 1 > /sys/kernel/debug/tracing/events/sched/enable + +2) Run trace agent in the guest + This agent must be operated as root. + # ./trace-agent +read/write threads in the agent wait for start order from host. If you add -o +option, trace data are output via stdout in the guest. + +3) Open FIFO in a host + # cat /tmp/virtio-trace/trace-path-cpu0.out +If a host does not open these, trace data get stuck in buffers of virtio. Then, +the guest will stop by specification of chardev in QEMU. This blocking mode may +be solved in the future. + +4) Start to read trace data by ordering from a host + A host injects read start order to the guest via virtio-serial. + # echo 1 > /tmp/virtio-trace/agent-ctl-path.in + +5) Stop to read trace data by ordering from a host + A host injects read stop order to the guest via virtio-serial. + # echo 0 > /tmp/virtio-trace/agent-ctl-path.in diff --git a/tools/virtio/virtio-trace/trace-agent-ctl.c b/tools/virtio/virtio-trace/trace-agent-ctl.c new file mode 100644 index 0000000..0739815 --- /dev/null +++ b/tools/virtio/virtio-trace/trace-agent-ctl.c @@ -0,0 +1,137 @@ +/* + * Controller of read/write threads for virtio-trace + * + * Copyright (C) 2012 Hitachi, Ltd. + * Created by Yoshihiro Yunomae <yoshihiro.yunomae.ez at hitachi.com> + * Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> + * + * Licensed under GPL version 2 only. + * + */ + +#define _GNU_SOURCE +#include <fcntl.h> +#include <poll.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include "trace-agent.h" + +#define HOST_MSG_SIZE 256 +#define EVENT_WAIT_MSEC 100 + +static volatile sig_atomic_t global_signal_val; +bool global_sig_receive; /* default false */ +bool global_run_operation; /* default false*/ + +/* Handle SIGTERM/SIGINT/SIGQUIT to exit */ +static void signal_handler(int sig) +{ + global_signal_val = sig; +} + +int rw_ctl_init(const char *ctl_path) +{ + int ctl_fd; + + ctl_fd = open(ctl_path, O_RDONLY); + if (ctl_fd == -1) { + fprintf(stderr, "Cannot open ctl_fd\n"); + goto error; + } + + return ctl_fd; + +error: + exit(EXIT_FAILURE); +} + +static int wait_order(int ctl_fd) +{ + struct pollfd poll_fd; + int ret = 0; + + while (!global_sig_receive) { + poll_fd.fd = ctl_fd; + poll_fd.events = POLLIN; + + ret = poll(&poll_fd, 1, EVENT_WAIT_MSEC); + + if (global_signal_val) { + global_sig_receive = true; + pr_info("Receive interrupt %d\n", global_signal_val); + + /* Wakes rw-threads when they are sleeping */ + if (!global_run_operation) + pthread_cond_broadcast(&cond_wakeup); + + ret = -1; + break; + } + + if (ret < 0) { + pr_err("Polling error\n"); + goto error; + } + + if (ret) + break; + }; + + return ret; + +error: + exit(EXIT_FAILURE); +} + +/* + * contol read/write threads by handling global_run_operation + */ +void *rw_ctl_loop(int ctl_fd) +{ + ssize_t rlen; + char buf[HOST_MSG_SIZE]; + int ret; + + /* Setup signal handlers */ + signal(SIGTERM, signal_handler); + signal(SIGINT, signal_handler); + signal(SIGQUIT, signal_handler); + + while (!global_sig_receive) { + + ret = wait_order(ctl_fd); + if (ret < 0) + break; + + rlen = read(ctl_fd, buf, sizeof(buf)); + if (rlen < 0) { + pr_err("read data error in ctl thread\n"); + goto error; + } + + if (rlen == 2 && buf[0] == '1') { + /* + * If host writes '1' to a control path, + * this controller wakes all read/write threads. + */ + global_run_operation = true; + pthread_cond_broadcast(&cond_wakeup); + pr_debug("Wake up all read/write threads\n"); + } else if (rlen == 2 && buf[0] == '0') { + /* + * If host writes '0' to a control path, read/write + * threads will wait for notification from Host. + */ + global_run_operation = false; + pr_debug("Stop all read/write threads\n"); + } else + pr_info("Invalid host notification: %s\n", buf); + } + + return NULL; + +error: + exit(EXIT_FAILURE); +} diff --git a/tools/virtio/virtio-trace/trace-agent-rw.c b/tools/virtio/virtio-trace/trace-agent-rw.c new file mode 100644 index 0000000..3aace5e --- /dev/null +++ b/tools/virtio/virtio-trace/trace-agent-rw.c @@ -0,0 +1,192 @@ +/* + * Read/write thread of a guest agent for virtio-trace + * + * Copyright (C) 2012 Hitachi, Ltd. + * Created by Yoshihiro Yunomae <yoshihiro.yunomae.ez at hitachi.com> + * Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> + * + * Licensed under GPL version 2 only. + * + */ + +#define _GNU_SOURCE +#include <fcntl.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <sys/syscall.h> +#include "trace-agent.h" + +#define READ_WAIT_USEC 100000 + +void *rw_thread_info_new(void) +{ + struct rw_thread_info *rw_ti; + + rw_ti = zalloc(sizeof(struct rw_thread_info)); + if (rw_ti == NULL) { + pr_err("rw_thread_info zalloc error\n"); + exit(EXIT_FAILURE); + } + + rw_ti->cpu_num = -1; + rw_ti->in_fd = -1; + rw_ti->out_fd = -1; + rw_ti->read_pipe = -1; + rw_ti->write_pipe = -1; + rw_ti->pipe_size = PIPE_INIT; + + return rw_ti; +} + +void *rw_thread_init(int cpu, const char *in_path, const char *out_path, + bool stdout_flag, unsigned long pipe_size, + struct rw_thread_info *rw_ti) +{ + int data_pipe[2]; + + rw_ti->cpu_num = cpu; + + /* set read(input) fd */ + rw_ti->in_fd = open(in_path, O_RDONLY); + if (rw_ti->in_fd == -1) { + pr_err("Could not open in_fd (CPU:%d)\n", cpu); + goto error; + } + + /* set write(output) fd */ + if (!stdout_flag) { + /* virtio-serial output mode */ + rw_ti->out_fd = open(out_path, O_WRONLY); + if (rw_ti->out_fd == -1) { + pr_err("Could not open out_fd (CPU:%d)\n", cpu); + goto error; + } + } else + /* stdout mode */ + rw_ti->out_fd = STDOUT_FILENO; + + if (pipe2(data_pipe, O_NONBLOCK) < 0) { + pr_err("Could not create pipe in rw-thread(%d)\n", cpu); + goto error; + } + + /* + * Size of pipe is 64kB in default based on fs/pipe.c. + * To read/write trace data speedy, pipe size is changed. + */ + if (fcntl(*data_pipe, F_SETPIPE_SZ, pipe_size) < 0) { + pr_err("Could not change pipe size in rw-thread(%d)\n", cpu); + goto error; + } + + rw_ti->read_pipe = data_pipe[1]; + rw_ti->write_pipe = data_pipe[0]; + rw_ti->pipe_size = pipe_size; + + return NULL; + +error: + exit(EXIT_FAILURE); +} + +/* Bind a thread to a cpu */ +static void bind_cpu(int cpu_num) +{ + cpu_set_t mask; + + CPU_ZERO(&mask); + CPU_SET(cpu_num, &mask); + + /* bind my thread to cpu_num by assigning zero to the first argument */ + if (sched_setaffinity(0, sizeof(mask), &mask) == -1) + pr_err("Could not set CPU#%d affinity\n", (int)cpu_num); +} + +static void *rw_thread_main(void *thread_info) +{ + ssize_t rlen, wlen; + ssize_t ret; + struct rw_thread_info *ts = (struct rw_thread_info *)thread_info; + + bind_cpu(ts->cpu_num); + + while (1) { + /* Wait for a read order of trace data by Host OS */ + if (!global_run_operation) { + pthread_mutex_lock(&mutex_notify); + pthread_cond_wait(&cond_wakeup, &mutex_notify); + pthread_mutex_unlock(&mutex_notify); + } + + if (global_sig_receive) + break; + + /* + * Each thread read trace_pipe_raw of each cpu bounding the + * thread, so contention of multi-threads does not occur. + */ + rlen = splice(ts->in_fd, NULL, ts->read_pipe, NULL, + ts->pipe_size, SPLICE_F_MOVE | SPLICE_F_MORE); + + if (rlen < 0) { + pr_err("Splice_read in rw-thread(%d)\n", ts->cpu_num); + goto error; + } else if (rlen == 0) { + /* + * If trace data do not exist or are unreadable not + * for exceeding the page size, splice_read returns + * NULL. Then, this waits for being filled the data in a + * ring-buffer. + */ + usleep(READ_WAIT_USEC); + pr_debug("Read retry(cpu:%d)\n", ts->cpu_num); + continue; + } + + wlen = 0; + + do { + ret = splice(ts->write_pipe, NULL, ts->out_fd, NULL, + rlen - wlen, + SPLICE_F_MOVE | SPLICE_F_MORE); + + if (ret < 0) { + pr_err("Splice_write in rw-thread(%d)\n", + ts->cpu_num); + goto error; + } else if (ret == 0) + /* + * When host reader is not in time for reading + * trace data, guest will be stopped. This is + * because char dev in QEMU is not supported + * non-blocking mode. Then, writer might be + * sleep in that case. + * This sleep will be removed by supporting + * non-blocking mode. + */ + sleep(1); + wlen += ret; + } while (wlen < rlen); + } + + return NULL; + +error: + exit(EXIT_FAILURE); +} + + +pthread_t rw_thread_run(struct rw_thread_info *rw_ti) +{ + int ret; + pthread_t rw_thread_per_cpu; + + ret = pthread_create(&rw_thread_per_cpu, NULL, rw_thread_main, rw_ti); + if (ret != 0) { + pr_err("Could not create a rw thread(%d)\n", rw_ti->cpu_num); + exit(EXIT_FAILURE); + } + + return rw_thread_per_cpu; +} diff --git a/tools/virtio/virtio-trace/trace-agent.c b/tools/virtio/virtio-trace/trace-agent.c new file mode 100644 index 0000000..1816259 --- /dev/null +++ b/tools/virtio/virtio-trace/trace-agent.c @@ -0,0 +1,270 @@ +/* + * Guest agent for virtio-trace + * + * Copyright (C) 2012 Hitachi, Ltd. + * Created by Yoshihiro Yunomae <yoshihiro.yunomae.ez at hitachi.com> + * Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> + * + * Licensed under GPL version 2 only. + * + */ + +#define _GNU_SOURCE +#include <limits.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include "trace-agent.h" + +#define PAGE_SIZE (sysconf(_SC_PAGE_SIZE)) +#define PIPE_DEF_BUFS 16 +#define PIPE_MIN_SIZE (PAGE_SIZE*PIPE_DEF_BUFS) +#define PIPE_MAX_SIZE (1024*1024) +#define READ_PATH_FMT \ + "/sys/kernel/debug/tracing/per_cpu/cpu%d/trace_pipe_raw" +#define WRITE_PATH_FMT "/dev/virtio-ports/trace-path-cpu%d" +#define CTL_PATH "/dev/virtio-ports/agent-ctl-path" + +pthread_mutex_t mutex_notify = PTHREAD_MUTEX_INITIALIZER; +pthread_cond_t cond_wakeup = PTHREAD_COND_INITIALIZER; + +static int get_total_cpus(void) +{ + int nr_cpus = (int)sysconf(_SC_NPROCESSORS_CONF); + + if (nr_cpus <= 0) { + pr_err("Could not read cpus\n"); + goto error; + } else if (nr_cpus > MAX_CPUS) { + pr_err("Exceed max cpus(%d)\n", (int)MAX_CPUS); + goto error; + } + + return nr_cpus; + +error: + exit(EXIT_FAILURE); +} + +static void *agent_info_new(void) +{ + struct agent_info *s; + int i; + + s = zalloc(sizeof(struct agent_info)); + if (s == NULL) { + pr_err("agent_info zalloc error\n"); + exit(EXIT_FAILURE); + } + + s->pipe_size = PIPE_INIT; + s->use_stdout = false; + s->cpus = get_total_cpus(); + s->ctl_fd = -1; + + /* read/write threads init */ + for (i = 0; i < s->cpus; i++) + s->rw_ti[i] = rw_thread_info_new(); + + return s; +} + +static unsigned long parse_size(const char *arg) +{ + unsigned long value, round; + char *ptr; + + value = strtoul(arg, &ptr, 10); + switch (*ptr) { + case 'K': case 'k': + value <<= 10; + break; + case 'M': case 'm': + value <<= 20; + break; + default: + break; + } + + if (value > PIPE_MAX_SIZE) { + pr_err("Pipe size must be less than 1MB\n"); + goto error; + } else if (value < PIPE_MIN_SIZE) { + pr_err("Pipe size must be over 64KB\n"); + goto error; + } + + /* Align buffer size with page unit */ + round = value & (PAGE_SIZE - 1); + value = value - round; + + return value; +error: + return 0; +} + +static void usage(char const *prg) +{ + fprintf(stderr, "usage: %s [-h] [-o] [-s <size of pipe>]\n", prg); +} + +static const char *make_path(int cpu_num, bool this_is_write_path) +{ + int ret; + char *buf; + + buf = zalloc(PATH_MAX); + if (buf == NULL) { + pr_err("Could not allocate buffer\n"); + goto error; + } + + if (this_is_write_path) + /* write(output) path */ + ret = snprintf(buf, PATH_MAX, WRITE_PATH_FMT, cpu_num); + else + /* read(input) path */ + ret = snprintf(buf, PATH_MAX, READ_PATH_FMT, cpu_num); + + if (ret <= 0) { + pr_err("Failed to generate %s path(CPU#%d):%d\n", + this_is_write_path ? "read" : "write", cpu_num, ret); + goto error; + } + + return buf; + +error: + free(buf); + return NULL; +} + +static const char *make_input_path(int cpu_num) +{ + return make_path(cpu_num, false); +} + +static const char *make_output_path(int cpu_num) +{ + return make_path(cpu_num, true); +} + +static void *agent_info_init(struct agent_info *s) +{ + int cpu; + const char *in_path = NULL; + const char *out_path = NULL; + + /* init read/write threads */ + for (cpu = 0; cpu < s->cpus; cpu++) { + /* set read(input) path per read/write thread */ + in_path = make_input_path(cpu); + if (in_path == NULL) + goto error; + + /* set write(output) path per read/write thread*/ + if (!s->use_stdout) { + out_path = make_output_path(cpu); + if (out_path == NULL) + goto error; + } else + /* stdout mode */ + pr_debug("stdout mode\n"); + + rw_thread_init(cpu, in_path, out_path, s->use_stdout, + s->pipe_size, s->rw_ti[cpu]); + } + + /* init controller of read/write threads */ + s->ctl_fd = rw_ctl_init((const char *)CTL_PATH); + + return NULL; + +error: + exit(EXIT_FAILURE); +} + +static void *parse_args(int argc, char *argv[], struct agent_info *s) +{ + int cmd; + unsigned long size; + + while ((cmd = getopt(argc, argv, "hos:")) != -1) { + switch (cmd) { + /* stdout mode */ + case 'o': + s->use_stdout = true; + break; + /* size of pipe */ + case 's': + size = parse_size(optarg); + if (size == 0) + goto error; + s->pipe_size = size; + break; + case 'h': + default: + usage(argv[0]); + goto error; + } + } + + agent_info_init(s); + + return NULL; + +error: + exit(EXIT_FAILURE); +} + +static void agent_main_loop(struct agent_info *s) +{ + int cpu; + pthread_t rw_thread_per_cpu[MAX_CPUS]; + + /* Start all read/write threads */ + for (cpu = 0; cpu < s->cpus; cpu++) + rw_thread_per_cpu[cpu] = rw_thread_run(s->rw_ti[cpu]); + + rw_ctl_loop(s->ctl_fd); + + /* Finish all read/write threads */ + for (cpu = 0; cpu < s->cpus; cpu++) { + int ret; + + ret = pthread_join(rw_thread_per_cpu[cpu], NULL); + if (ret != 0) { + pr_err("pthread_join() error:%d (cpu %d)\n", ret, cpu); + exit(EXIT_FAILURE); + } + } +} + +static void agent_info_free(struct agent_info *s) +{ + int i; + + close(s->ctl_fd); + for (i = 0; i < s->cpus; i++) { + close(s->rw_ti[i]->in_fd); + close(s->rw_ti[i]->out_fd); + close(s->rw_ti[i]->read_pipe); + close(s->rw_ti[i]->write_pipe); + free(s->rw_ti[i]); + } + free(s); +} + +int main(int argc, char *argv[]) +{ + struct agent_info *s = NULL; + + s = agent_info_new(); + parse_args(argc, argv, s); + + agent_main_loop(s); + + agent_info_free(s); + + return 0; +} diff --git a/tools/virtio/virtio-trace/trace-agent.h b/tools/virtio/virtio-trace/trace-agent.h new file mode 100644 index 0000000..4711a75 --- /dev/null +++ b/tools/virtio/virtio-trace/trace-agent.h @@ -0,0 +1,75 @@ +#ifndef __GUEST_AGENT_H__ +#define __GUEST_AGENT_H__ +#include <pthread.h> +#include <stdbool.h> + +#define MAX_CPUS 256 +#define PIPE_INIT (1024*1024) + +/* + * agent_info - structure managing total information of guest agent + * @pipe_size: size of pipe (default 1MB) + * @use_stdout: set to true when o option is added (default false) + * @cpus: total number of CPUs + * @ctl_fd: fd of control path, /dev/virtio-ports/agent-ctl-path + * @rw_ti: structure managing information of read/write threads + */ +struct agent_info { + unsigned long pipe_size; + bool use_stdout; + int cpus; + int ctl_fd; + struct rw_thread_info *rw_ti[MAX_CPUS]; +}; + +/* + * rw_thread_info - structure managing a read/write thread a cpu + * @cpu_num: cpu number operating this read/write thread + * @in_fd: fd of reading trace data path in cpu_num + * @out_fd: fd of writing trace data path in cpu_num + * @read_pipe: fd of read pipe + * @write_pipe: fd of write pipe + * @pipe_size: size of pipe (default 1MB) + */ +struct rw_thread_info { + int cpu_num; + int in_fd; + int out_fd; + int read_pipe; + int write_pipe; + unsigned long pipe_size; +}; + +/* use for stopping rw threads */ +extern bool global_sig_receive; + +/* use for notification */ +extern bool global_run_operation; +extern pthread_mutex_t mutex_notify; +extern pthread_cond_t cond_wakeup; + +/* for controller of read/write threads */ +extern int rw_ctl_init(const char *ctl_path); +extern void *rw_ctl_loop(int ctl_fd); + +/* for trace read/write thread */ +extern void *rw_thread_info_new(void); +extern void *rw_thread_init(int cpu, const char *in_path, const char *out_path, + bool stdout_flag, unsigned long pipe_size, + struct rw_thread_info *rw_ti); +extern pthread_t rw_thread_run(struct rw_thread_info *rw_ti); + +static inline void *zalloc(size_t size) +{ + return calloc(1, size); +} + +#define pr_err(format, ...) fprintf(stderr, format, ## __VA_ARGS__) +#define pr_info(format, ...) fprintf(stdout, format, ## __VA_ARGS__) +#ifdef DEBUG +#define pr_debug(format, ...) fprintf(stderr, format, ## __VA_ARGS__) +#else +#define pr_debug(format, ...) do {} while (0) +#endif + +#endif /*__GUEST_AGENT_H__*/
(2012/07/24 11:36), Yoshihiro YUNOMAE wrote:> Therefore, we propose a new system "virtio-trace", which uses enhanced > virtio-serial and existing ring-buffer of ftrace, for collecting guest kernel > tracing data. In this system, there are 5 main components: > (1) Ring-buffer of ftrace in a guest > - When trace agent reads ring-buffer, a page is removed from ring-buffer. > (2) Trace agent in the guest > - Splice the page of ring-buffer to read_pipe using splice() without > memory copying. Then, the page is spliced from write_pipe to virtio > without memory copying. > (3) Virtio-console driver in the guest > - Pass the page to virtio-ring > (4) Virtio-serial bus in QEMU > - Copy the page to kernel pipe > (5) Reader in the host > - Read guest tracing data via FIFO(named pipe)So, this is our answer for the argued points in previous thread. This virtio-serial and ftrace enhancements doesn't introduce new "ringbuffer" in the kernel, and just use virtio's ringbuffer. Also, using splice gives us a great advantage in the performance because of copy-less trace-data transfer. Actually, one copy should occur in the host (to write it into the pipe), because removing physical pages of the guest is hard to track and may involve a TLB flush per page, even if it is done in background. Thank you, -- Masami HIRAMATSU Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu.pt at hitachi.com
On Tue, Jul 24, 2012 at 3:36 AM, Yoshihiro YUNOMAE <yoshihiro.yunomae.ez at hitachi.com> wrote:> The performance of each method is compared as follows: > [1] Native > - only recording trace data to ring-buffer on a guest > [2] Virtio-trace > - running a trace agent on a guest > - a reader on a host opens FIFO using cat command > [3] IVRing > - A SystemTap script in a guest records trace data to IVRing. > -- probe points are same as ftrace. > [4] Virtio-serial(normal) > - A reader(using cat) on a guest output trace data to a host using > standard output via virtio-serial.The first time I read this I thought you are adding a new virtio-trace device. But it looks like this series really add splice support to virtio-console and that yields a big performance improvement when sending trace_pipe_raw. Guest ftrace is useful and I like this. Have you thought about controlling ftrace from the host? Perhaps a command could be added to the QEMU guest agent which basically invokes trace-cmd/perf. Are you using text formatted ftrace? Stefan
Blue Swirl
2012-Jul-24 20:26 UTC
[Qemu-devel] [RFC PATCH 0/6] virtio-trace: Support virtio-trace
On Tue, Jul 24, 2012 at 2:36 AM, Yoshihiro YUNOMAE <yoshihiro.yunomae.ez at hitachi.com> wrote:> Hi All, > > The following patch set provides a low-overhead system for collecting kernel > tracing data of guests by a host in a virtualization environment. > > A guest OS generally shares some devices with other guests or a host, so > reasons of any problems occurring in a guest may be from other guests or a host. > Then, to collect some tracing data of a number of guests and a host is needed > when some problems occur in a virtualization environment. One of methods to > realize that is to collect tracing data of guests in a host. To do this, network > is generally used. However, high load will be taken to applications on guests > using network I/O because there are many network stack layers. Therefore, > a communication method for collecting the data without using network is needed.I implemented something similar earlier by passing trace data from OpenBIOS to QEMU using the firmware configuration device. The data format was the same as QEMU used for simpletrace event structure instead of ftrace. I didn't commit it because of a few problems. I'm not familiar with ftrace, is it possible to trace two guest applications (BIOS and kernel) at the same time? Or could this be handled by opening two different virtio-serial pipes, one for BIOS and the other for the kernel? In my version, the tracepoint ID would have been used to demultiplex QEMU tracepoints from BIOS tracepoints, but something like separate ID spaces would have been better.> > We submitted a patch set of "IVRing", a ring-buffer driver constructed on > Inter-VM shared memory (IVShmem), to LKML http://lwn.net/Articles/500304/ in > this June. IVRing and the IVRing reader use POSIX shared memory each other > without using network, so a low-overhead system for collecting guest tracing > data is realized. However, this patch set has some problems as follows: > - use IVShmem instead of virtio > - create a new ring-buffer without using existing ring-buffer in kernel > - scalability > -- not support SMP environment > -- buffer size limitation > -- not support live migration (maybe difficult for realize this) > > Therefore, we propose a new system "virtio-trace", which uses enhanced > virtio-serial and existing ring-buffer of ftrace, for collecting guest kernel > tracing data. In this system, there are 5 main components: > (1) Ring-buffer of ftrace in a guest > - When trace agent reads ring-buffer, a page is removed from ring-buffer. > (2) Trace agent in the guest > - Splice the page of ring-buffer to read_pipe using splice() without > memory copying. Then, the page is spliced from write_pipe to virtio > without memory copying. > (3) Virtio-console driver in the guest > - Pass the page to virtio-ring > (4) Virtio-serial bus in QEMU > - Copy the page to kernel pipe > (5) Reader in the host > - Read guest tracing data via FIFO(named pipe) > > ***Evaluation*** > When a host collects tracing data of a guest, the performance of using > virtio-trace is compared with that of using native(just running ftrace), > IVRing, and virtio-serial(normal method of read/write). > > <environment> > The overview of this evaluation is as follows: > (a) A guest on a KVM is prepared. > - The guest is dedicated one physical CPU as a virtual CPU(VCPU). > > (b) The guest starts to write tracing data to ring-buffer of ftrace. > - The probe points are all trace points of sched, timer, and kmem. > > (c) Writing trace data, dhrystone 2 in UNIX bench is executed as a benchmark > tool in the guest. > - Dhrystone 2 intends system performance by repeating integer arithmetic > as a score. > - Since higher score equals to better system performance, if the score > decrease based on bare environment, it indicates that any operation > disturbs the integer arithmetic. Then, we define the overhead of > transporting trace data is calculated as follows: > OVERHEAD = (1 - SCORE_OF_A_METHOD/NATIVE_SCORE) * 100. > > The performance of each method is compared as follows: > [1] Native > - only recording trace data to ring-buffer on a guest > [2] Virtio-trace > - running a trace agent on a guest > - a reader on a host opens FIFO using cat command > [3] IVRing > - A SystemTap script in a guest records trace data to IVRing. > -- probe points are same as ftrace. > [4] Virtio-serial(normal) > - A reader(using cat) on a guest output trace data to a host using > standard output via virtio-serial. > > Other information is as follows: > - host > kernel: 3.3.7-1 (Fedora16) > CPU: Intel Xeon x5660 at 2.80GHz(12core) > Memory: 48GB > > - guest(only booting one guest) > kernel: 3.5.0-rc4+ (Fedora16) > CPU: 1VCPU(dedicated) > Memory: 1GB > > <result> > 3 patterns based on the bare environment were indicated as follows: > Scores overhead against [0] Native > [0] Native: 28807569.5 - > [1] Virtio-trace: 28685049.5 0.43% > [2] IVRing: 28418595.5 1.35% > [3] Virtio-serial: 13262258.7 53.96% > > > ***Just enhancement ideas*** > - Support for trace-cmd > - Support for 9pfs protocol > - Support for non-blocking mode in QEMU > - Make "vhost-serial" > > Thank you, > > --- > > Masami Hiramatsu (5): > virtio/console: Allocate scatterlist according to the current pipe size > ftrace: Allow stealing pages from pipe buffer > virtio/console: Wait until the port is ready on splice > virtio/console: Add a failback for unstealable pipe buffer > virtio/console: Add splice_write support > > Yoshihiro YUNOMAE (1): > tools: Add guest trace agent as a user tool > > > drivers/char/virtio_console.c | 198 ++++++++++++++++++-- > kernel/trace/trace.c | 8 - > tools/virtio/virtio-trace/Makefile | 14 + > tools/virtio/virtio-trace/README | 118 ++++++++++++ > tools/virtio/virtio-trace/trace-agent-ctl.c | 137 ++++++++++++++ > tools/virtio/virtio-trace/trace-agent-rw.c | 192 +++++++++++++++++++ > tools/virtio/virtio-trace/trace-agent.c | 270 +++++++++++++++++++++++++++ > tools/virtio/virtio-trace/trace-agent.h | 75 ++++++++ > 8 files changed, 985 insertions(+), 27 deletions(-) > create mode 100644 tools/virtio/virtio-trace/Makefile > create mode 100644 tools/virtio/virtio-trace/README > create mode 100644 tools/virtio/virtio-trace/trace-agent-ctl.c > create mode 100644 tools/virtio/virtio-trace/trace-agent-rw.c > create mode 100644 tools/virtio/virtio-trace/trace-agent.c > create mode 100644 tools/virtio/virtio-trace/trace-agent.h > > -- > Yoshihiro YUNOMAE > Software Platform Research Dept. Linux Technology Center > Hitachi, Ltd., Yokohama Research Laboratory > E-mail: yoshihiro.yunomae.ez at hitachi.com > >
On (Tue) 24 Jul 2012 [11:36:57], Yoshihiro YUNOMAE wrote:> Hi All, > > The following patch set provides a low-overhead system for collecting kernel > tracing data of guests by a host in a virtualization environment. > > A guest OS generally shares some devices with other guests or a host, so > reasons of any problems occurring in a guest may be from other guests or a host. > Then, to collect some tracing data of a number of guests and a host is needed > when some problems occur in a virtualization environment. One of methods to > realize that is to collect tracing data of guests in a host. To do this, network > is generally used. However, high load will be taken to applications on guests > using network I/O because there are many network stack layers. Therefore, > a communication method for collecting the data without using network is needed. > > We submitted a patch set of "IVRing", a ring-buffer driver constructed on > Inter-VM shared memory (IVShmem), to LKML http://lwn.net/Articles/500304/ in > this June. IVRing and the IVRing reader use POSIX shared memory each other > without using network, so a low-overhead system for collecting guest tracing > data is realized. However, this patch set has some problems as follows: > - use IVShmem instead of virtio > - create a new ring-buffer without using existing ring-buffer in kernel > - scalability > -- not support SMP environment > -- buffer size limitation > -- not support live migration (maybe difficult for realize this) > > Therefore, we propose a new system "virtio-trace", which uses enhanced > virtio-serial and existing ring-buffer of ftrace, for collecting guest kernel > tracing data. In this system, there are 5 main components: > (1) Ring-buffer of ftrace in a guest > - When trace agent reads ring-buffer, a page is removed from ring-buffer. > (2) Trace agent in the guest > - Splice the page of ring-buffer to read_pipe using splice() without > memory copying. Then, the page is spliced from write_pipe to virtio > without memory copying.I really like the splicing idea.> (3) Virtio-console driver in the guest > - Pass the page to virtio-ring > (4) Virtio-serial bus in QEMU > - Copy the page to kernel pipe > (5) Reader in the host > - Read guest tracing data via FIFO(named pipe)So will this be useful only if guest and host run the same kernel? I'd like to see the host kernel not being used at all -- collect all relevant info from the guest and send it out to qemu, where it can be consumed directly by apps driving the tracing.> ***Evaluation*** > When a host collects tracing data of a guest, the performance of using > virtio-trace is compared with that of using native(just running ftrace), > IVRing, and virtio-serial(normal method of read/write).Why is tracing performance-sensitive? i.e. why try to optimise this at all?> <environment> > The overview of this evaluation is as follows: > (a) A guest on a KVM is prepared. > - The guest is dedicated one physical CPU as a virtual CPU(VCPU). > > (b) The guest starts to write tracing data to ring-buffer of ftrace. > - The probe points are all trace points of sched, timer, and kmem. > > (c) Writing trace data, dhrystone 2 in UNIX bench is executed as a benchmark > tool in the guest. > - Dhrystone 2 intends system performance by repeating integer arithmetic > as a score. > - Since higher score equals to better system performance, if the score > decrease based on bare environment, it indicates that any operation > disturbs the integer arithmetic. Then, we define the overhead of > transporting trace data is calculated as follows: > OVERHEAD = (1 - SCORE_OF_A_METHOD/NATIVE_SCORE) * 100. > > The performance of each method is compared as follows: > [1] Native > - only recording trace data to ring-buffer on a guest > [2] Virtio-trace > - running a trace agent on a guest > - a reader on a host opens FIFO using cat command > [3] IVRing > - A SystemTap script in a guest records trace data to IVRing. > -- probe points are same as ftrace. > [4] Virtio-serial(normal) > - A reader(using cat) on a guest output trace data to a host using > standard output via virtio-serial. > > Other information is as follows: > - host > kernel: 3.3.7-1 (Fedora16) > CPU: Intel Xeon x5660 at 2.80GHz(12core) > Memory: 48GB > > - guest(only booting one guest) > kernel: 3.5.0-rc4+ (Fedora16) > CPU: 1VCPU(dedicated) > Memory: 1GB > > <result> > 3 patterns based on the bare environment were indicated as follows: > Scores overhead against [0] Native > [0] Native: 28807569.5 - > [1] Virtio-trace: 28685049.5 0.43% > [2] IVRing: 28418595.5 1.35% > [3] Virtio-serial: 13262258.7 53.96% > > > ***Just enhancement ideas*** > - Support for trace-cmd > - Support for 9pfs protocol > - Support for non-blocking mode in QEMUThere were patches long back (by me) to make chardevs non-blocking but they didn't make it upstream. Fedora carries them, if you want to try out. Though we want to converge on a reasonable solution that's acceptable upstream as well. Just that no one's working on it currently. Any help here will be appreciated.> - Make "vhost-serial"I need to understand a) why it's perf-critical, and b) why should the host be involved at all, to comment on these. Thanks, Amit
On (Tue) 24 Jul 2012 [11:37:07], Yoshihiro YUNOMAE wrote:> From: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> > > Enable to use splice_write from pipe to virtio-console port. > This steals pages from pipe and directly send it to host. > > Note that this may accelerate only the guest to host path. > > Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> > Cc: Amit Shah <amit.shah at redhat.com> > Cc: Arnd Bergmann <arnd at arndb.de> > Cc: Greg Kroah-Hartman <gregkh at linuxfoundation.org> > ---> +/* Faster zero-copy write by splicing */ > +static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe, > + struct file *filp, loff_t *ppos, > + size_t len, unsigned int flags) > +{ > + struct port *port = filp->private_data; > + struct sg_list sgl; > + ssize_t ret; > + struct splice_desc sd = { > + .total_len = len, > + .flags = flags, > + .pos = *ppos, > + .u.data = &sgl, > + }; > + > + sgl.n = 0; > + sgl.len = 0; > + sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES, > + GFP_ATOMIC);Do you expect this function to be called from interrupt context?> + if (unlikely(!sgl.sg)) > + return -ENOMEM; > + > + sg_init_table(sgl.sg, MAX_SPLICE_PAGES); > + ret = __splice_from_pipe(pipe, &sd, pipe_to_sg); > + if (likely(ret > 0)) > + ret = send_pages(port, sgl.sg, sgl.n, sgl.len, true); > + > + return ret; > +}Amit
Masami Hiramatsu
2012-Aug-09 09:12 UTC
[RFC PATCH 1/6] virtio/console: Add splice_write support
(2012/08/09 18:00), Amit Shah wrote:> On (Tue) 24 Jul 2012 [11:37:07], Yoshihiro YUNOMAE wrote: >> From: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> >> >> Enable to use splice_write from pipe to virtio-console port. >> This steals pages from pipe and directly send it to host. >> >> Note that this may accelerate only the guest to host path. >> >> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt at hitachi.com> >> Cc: Amit Shah <amit.shah at redhat.com> >> Cc: Arnd Bergmann <arnd at arndb.de> >> Cc: Greg Kroah-Hartman <gregkh at linuxfoundation.org> >> --- > >> +/* Faster zero-copy write by splicing */ >> +static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe, >> + struct file *filp, loff_t *ppos, >> + size_t len, unsigned int flags) >> +{ >> + struct port *port = filp->private_data; >> + struct sg_list sgl; >> + ssize_t ret; >> + struct splice_desc sd = { >> + .total_len = len, >> + .flags = flags, >> + .pos = *ppos, >> + .u.data = &sgl, >> + }; >> + >> + sgl.n = 0; >> + sgl.len = 0; >> + sgl.sg = kmalloc(sizeof(struct scatterlist) * MAX_SPLICE_PAGES, >> + GFP_ATOMIC); > > Do you expect this function to be called from interrupt context?No, not at all. Oops, that should be GFP_KERNEL... Thank you for pointing it out. -- Masami HIRAMATSU Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu.pt at hitachi.com
Hi, On (Tue) 24 Jul 2012 [11:36:57], Yoshihiro YUNOMAE wrote:> Hi All, > > The following patch set provides a low-overhead system for collecting kernel > tracing data of guests by a host in a virtualization environment.So I just have one minor comment, please post a non-RFC version of the patch. Since you have an ACK from Steven for the ftrace patch, I guess Rusty can push this in via his virtio tree? I'll ack the virtio-console bits in the next series you send. Thanks, Amit
Apparently Analagous Threads
- [RFC PATCH 0/6] virtio-trace: Support virtio-trace
- [PATCH V2 0/6] virtio-trace: Support virtio-trace
- [PATCH V2 0/6] virtio-trace: Support virtio-trace
- [PATCH 0/5] trace-cmd: Add a recorder readable feature for virtio-trace
- [PATCH 0/5] trace-cmd: Add a recorder readable feature for virtio-trace