[RFC PATCH] LTTng relay buffer allocation, read, write - Kernel

This is a discussion on [RFC PATCH] LTTng relay buffer allocation, read, write - Kernel ; As I told Martin, I was thinking about taking an axe and moving stuff around in relay. Which I just did. This patch reimplements relay with a linked list of pages. Provides read/write wrappers which should be used to read ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 24

Thread: [RFC PATCH] LTTng relay buffer allocation, read, write

  1. [RFC PATCH] LTTng relay buffer allocation, read, write

    As I told Martin, I was thinking about taking an axe and moving stuff around in
    relay. Which I just did.

    This patch reimplements relay with a linked list of pages. Provides read/write
    wrappers which should be used to read or write from the buffers. It's the core
    of a layered approach to the design requirements expressed by Martin and
    discussed earlier.

    It does not provide _any_ sort of locking on buffer data. Locking should be done
    by the caller. Given that we might think of very lightweight locking schemes, it
    makes sense to me that the underlying buffering infrastructure supports event
    records larger than 1 page.

    A cache saving 3 pointers is used to keep track of current page used for the
    buffer for write, read and current subbuffer header pointer lookup. The offset
    of each page within the buffer is saved in the page frame structure to permit
    cache lookup without extra locking.

    TODO : Currently, no splice file operations are implemented. Should come soon.
    The idea is to splice the buffers directly into files or to the network.

    This patch is released as early RFC. It builds, that's about it. Testing will
    come when I implement the splice ops.

    Signed-off-by: Mathieu Desnoyers
    CC: Martin Bligh
    CC: Peter Zijlstra
    CC: prasad@linux.vnet.ibm.com
    CC: Linus Torvalds
    CC: Thomas Gleixner
    CC: Steven Rostedt
    CC: od@suse.com
    CC: "Frank Ch. Eigler"
    CC: Andrew Morton
    CC: hch@lst.de
    CC: David Wilder
    ---
    include/linux/ltt-relay.h | 291 +++++++++++++++++++++++++++++
    ltt/ltt-relay-alloc.c | 450 ++++++++++++++++++++++++++++++++++++++++++++++
    2 files changed, 741 insertions(+)

    Index: linux-2.6-lttng/ltt/ltt-relay-alloc.c
    ================================================== =================
    --- /dev/null 1970-01-01 00:00:00.000000000 +0000
    +++ linux-2.6-lttng/ltt/ltt-relay-alloc.c 2008-09-27 09:18:47.000000000 -0400
    @@ -0,0 +1,450 @@
    +/*
    + * Public API and common code for kernel->userspace relay file support.
    + *
    + * Copyright (C) 2002-2005 - Tom Zanussi (zanussi@us.ibm.com), IBM Corp
    + * Copyright (C) 1999-2005 - Karim Yaghmour (karim@opersys.com)
    + * Copyright (C) 2008 - Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca)
    + *
    + * Moved to kernel/relay.c by Paul Mundt, 2006.
    + * November 2006 - CPU hotplug support by Mathieu Desnoyers
    + * (mathieu.desnoyers@polymtl.ca)
    + *
    + * This file is released under the GPL.
    + */
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +/* list of open channels, for cpu hotplug */
    +static DEFINE_MUTEX(relay_channels_mutex);
    +static LIST_HEAD(relay_channels);
    +
    +/**
    + * relay_alloc_buf - allocate a channel buffer
    + * @buf: the buffer struct
    + * @size: total size of the buffer
    + */
    +static int relay_alloc_buf(struct rchan_buf *buf, size_t *size)
    +{
    + unsigned int i, n_pages;
    + struct buf_page *buf_page, *n;
    +
    + *size = PAGE_ALIGN(*size);
    + n_pages = *size >> PAGE_SHIFT;
    +
    + for (i = 0; i < n_pages; i++) {
    + buf_page = (struct buf_page *)alloc_page(GFP_KERNEL);
    + if (unlikely(!buf_page))
    + goto depopulate;
    + list_add(&buf_page->list, &buf->pages);
    + buf_page->offset = i << PAGE_SHIFT;
    + buf_page->buf = buf;
    + memset(page_address(&buf_page->page), 0, PAGE_SIZE);
    + if (i == 0) {
    + buf->wpage = buf_page;
    + buf->hpage = buf_page;
    + buf->rpage = buf_page;
    + }
    + }
    + buf->page_count = n_pages;
    + return 0;
    +
    +depopulate:
    + list_for_each_entry_safe(buf_page, n, &buf->pages, list)
    + __free_page(&buf_page->page);
    + return -ENOMEM;
    +}
    +
    +/**
    + * relay_create_buf - allocate and initialize a channel buffer
    + * @chan: the relay channel
    + *
    + * Returns channel buffer if successful, %NULL otherwise.
    + */
    +static struct rchan_buf *relay_create_buf(struct rchan *chan)
    +{
    + int ret;
    + struct rchan_buf *buf = kzalloc(sizeof(struct rchan_buf), GFP_KERNEL);
    + if (!buf)
    + return NULL;
    +
    + ret = relay_alloc_buf(buf, &chan->alloc_size);
    + if (ret)
    + goto free_buf;
    +
    + buf->chan = chan;
    + kref_get(&buf->chan->kref);
    + return buf;
    +
    +free_buf:
    + kfree(buf);
    + return NULL;
    +}
    +
    +/**
    + * relay_destroy_channel - free the channel struct
    + * @kref: target kernel reference that contains the relay channel
    + *
    + * Should only be called from kref_put().
    + */
    +static void relay_destroy_channel(struct kref *kref)
    +{
    + struct rchan *chan = container_of(kref, struct rchan, kref);
    + kfree(chan);
    +}
    +
    +/**
    + * relay_destroy_buf - destroy an rchan_buf struct and associated buffer
    + * @buf: the buffer struct
    + */
    +static void relay_destroy_buf(struct rchan_buf *buf)
    +{
    + struct rchan *chan = buf->chan;
    + struct buf_page *buf_page, *n;
    +
    + list_for_each_entry_safe(buf_page, n, &buf->pages, list)
    + __free_page(&buf_page->page);
    + chan->buf[buf->cpu] = NULL;
    + kfree(buf);
    + kref_put(&chan->kref, relay_destroy_channel);
    +}
    +
    +/**
    + * relay_remove_buf - remove a channel buffer
    + * @kref: target kernel reference that contains the relay buffer
    + *
    + * Removes the file from the fileystem, which also frees the
    + * rchan_buf_struct and the channel buffer. Should only be called from
    + * kref_put().
    + */
    +static void relay_remove_buf(struct kref *kref)
    +{
    + struct rchan_buf *buf = container_of(kref, struct rchan_buf, kref);
    + buf->chan->cb->remove_buf_file(buf->dentry);
    + relay_destroy_buf(buf);
    +}
    +
    +/*
    + * High-level relay kernel API and associated functions.
    + */
    +
    +/*
    + * rchan_callback implementations defining default channel behavior. Used
    + * in place of corresponding NULL values in client callback struct.
    + */
    +
    +/*
    + * create_buf_file_create() default callback. Does nothing.
    + */
    +static struct dentry *create_buf_file_default_callback(const char *filename,
    + struct dentry *parent,
    + int mode,
    + struct rchan_buf *buf)
    +{
    + return NULL;
    +}
    +
    +/*
    + * remove_buf_file() default callback. Does nothing.
    + */
    +static int remove_buf_file_default_callback(struct dentry *dentry)
    +{
    + return -EINVAL;
    +}
    +
    +/* relay channel default callbacks */
    +static struct rchan_callbacks default_channel_callbacks = {
    + .create_buf_file = create_buf_file_default_callback,
    + .remove_buf_file = remove_buf_file_default_callback,
    +};
    +
    +/**
    + * wakeup_readers - wake up readers waiting on a channel
    + * @data: contains the channel buffer
    + *
    + * This is the timer function used to defer reader waking.
    + */
    +static void wakeup_readers(unsigned long data)
    +{
    + struct rchan_buf *buf = (struct rchan_buf *)data;
    + wake_up_interruptible(&buf->read_wait);
    +}
    +
    +/**
    + * __relay_reset - reset a channel buffer
    + * @buf: the channel buffer
    + * @init: 1 if this is a first-time initialization
    + *
    + * See relay_reset() for description of effect.
    + */
    +static void __relay_reset(struct rchan_buf *buf, unsigned int init)
    +{
    + if (init) {
    + init_waitqueue_head(&buf->read_wait);
    + kref_init(&buf->kref);
    + setup_timer(&buf->timer, wakeup_readers, (unsigned long)buf);
    + } else
    + del_timer_sync(&buf->timer);
    +
    + buf->finalized = 0;
    +}
    +
    +/*
    + * relay_open_buf - create a new relay channel buffer
    + *
    + * used by relay_open() and CPU hotplug.
    + */
    +static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu)
    +{
    + struct rchan_buf *buf = NULL;
    + struct dentry *dentry;
    + char *tmpname;
    +
    + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
    + if (!tmpname)
    + goto end;
    + snprintf(tmpname, NAME_MAX, "%s%d", chan->base_filename, cpu);
    +
    + buf = relay_create_buf(chan);
    + if (!buf)
    + goto free_name;
    +
    + buf->cpu = cpu;
    + __relay_reset(buf, 1);
    +
    + /* Create file in fs */
    + dentry = chan->cb->create_buf_file(tmpname, chan->parent, S_IRUSR,
    + buf);
    + if (!dentry)
    + goto free_buf;
    +
    + buf->dentry = dentry;
    +
    + goto free_name;
    +
    +free_buf:
    + relay_destroy_buf(buf);
    + buf = NULL;
    +free_name:
    + kfree(tmpname);
    +end:
    + return buf;
    +}
    +
    +/**
    + * relay_close_buf - close a channel buffer
    + * @buf: channel buffer
    + *
    + * Marks the buffer finalized and restores the default callbacks.
    + * The channel buffer and channel buffer data structure are then freed
    + * automatically when the last reference is given up.
    + */
    +static void relay_close_buf(struct rchan_buf *buf)
    +{
    + del_timer_sync(&buf->timer);
    + kref_put(&buf->kref, relay_remove_buf);
    +}
    +
    +static void setup_callbacks(struct rchan *chan,
    + struct rchan_callbacks *cb)
    +{
    + if (!cb) {
    + chan->cb = &default_channel_callbacks;
    + return;
    + }
    +
    + if (!cb->create_buf_file)
    + cb->create_buf_file = create_buf_file_default_callback;
    + if (!cb->remove_buf_file)
    + cb->remove_buf_file = remove_buf_file_default_callback;
    + chan->cb = cb;
    +}
    +
    +/**
    + * relay_hotcpu_callback - CPU hotplug callback
    + * @nb: notifier block
    + * @action: hotplug action to take
    + * @hcpu: CPU number
    + *
    + * Returns the success/failure of the operation. (%NOTIFY_OK, %NOTIFY_BAD)
    + */
    +static int __cpuinit relay_hotcpu_callback(struct notifier_block *nb,
    + unsigned long action,
    + void *hcpu)
    +{
    + unsigned int hotcpu = (unsigned long)hcpu;
    + struct rchan *chan;
    +
    + switch(action) {
    + case CPU_UP_PREPARE:
    + case CPU_UP_PREPARE_FROZEN:
    + mutex_lock(&relay_channels_mutex);
    + list_for_each_entry(chan, &relay_channels, list) {
    + if (chan->buf[hotcpu])
    + continue;
    + chan->buf[hotcpu] = relay_open_buf(chan, hotcpu);
    + if(!chan->buf[hotcpu]) {
    + printk(KERN_ERR
    + "relay_hotcpu_callback: cpu %d buffer "
    + "creation failed\n", hotcpu);
    + mutex_unlock(&relay_channels_mutex);
    + return NOTIFY_BAD;
    + }
    + }
    + mutex_unlock(&relay_channels_mutex);
    + break;
    + case CPU_DEAD:
    + case CPU_DEAD_FROZEN:
    + /* No need to flush the cpu : will be flushed upon
    + * final relay_flush() call. */
    + break;
    + }
    + return NOTIFY_OK;
    +}
    +
    +/**
    + * ltt_relay_open - create a new relay channel
    + * @base_filename: base name of files to create
    + * @parent: dentry of parent directory, %NULL for root directory
    + * @subbuf_size: size of sub-buffers
    + * @n_subbufs: number of sub-buffers
    + * @cb: client callback functions
    + * @private_data: user-defined data
    + *
    + * Returns channel pointer if successful, %NULL otherwise.
    + *
    + * Creates a channel buffer for each cpu using the sizes and
    + * attributes specified. The created channel buffer files
    + * will be named base_filename0...base_filenameN-1. File
    + * permissions will be %S_IRUSR.
    + */
    +struct rchan *ltt_relay_open(const char *base_filename,
    + struct dentry *parent,
    + size_t subbuf_size,
    + size_t n_subbufs,
    + struct rchan_callbacks *cb,
    + void *private_data)
    +{
    + unsigned int i;
    + struct rchan *chan;
    + if (!base_filename)
    + return NULL;
    +
    + if (!(subbuf_size && n_subbufs))
    + return NULL;
    +
    + chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
    + if (!chan)
    + return NULL;
    +
    + chan->version = LTT_RELAY_CHANNEL_VERSION;
    + chan->n_subbufs = n_subbufs;
    + chan->subbuf_size = subbuf_size;
    + chan->alloc_size = FIX_SIZE(subbuf_size * n_subbufs);
    + chan->parent = parent;
    + chan->private_data = private_data;
    + strlcpy(chan->base_filename, base_filename, NAME_MAX);
    + setup_callbacks(chan, cb);
    + kref_init(&chan->kref);
    +
    + mutex_lock(&relay_channels_mutex);
    + for_each_online_cpu(i) {
    + chan->buf[i] = relay_open_buf(chan, i);
    + if (!chan->buf[i])
    + goto free_bufs;
    + }
    + list_add(&chan->list, &relay_channels);
    + mutex_unlock(&relay_channels_mutex);
    +
    + return chan;
    +
    +free_bufs:
    + for_each_online_cpu(i) {
    + if (!chan->buf[i])
    + break;
    + relay_close_buf(chan->buf[i]);
    + }
    +
    + kref_put(&chan->kref, relay_destroy_channel);
    + mutex_unlock(&relay_channels_mutex);
    + return NULL;
    +}
    +EXPORT_SYMBOL_GPL(ltt_relay_open);
    +
    +/**
    + * ltt_relay_close - close the channel
    + * @chan: the channel
    + *
    + * Closes all channel buffers and frees the channel.
    + */
    +void ltt_relay_close(struct rchan *chan)
    +{
    + unsigned int i;
    +
    + if (!chan)
    + return;
    +
    + mutex_lock(&relay_channels_mutex);
    + for_each_possible_cpu(i)
    + if (chan->buf[i])
    + relay_close_buf(chan->buf[i]);
    +
    + list_del(&chan->list);
    + kref_put(&chan->kref, relay_destroy_channel);
    + mutex_unlock(&relay_channels_mutex);
    +}
    +EXPORT_SYMBOL_GPL(ltt_relay_close);
    +
    +/**
    + * relay_file_open - open file op for relay files
    + * @inode: the inode
    + * @filp: the file
    + *
    + * Increments the channel buffer refcount.
    + */
    +static int relay_file_open(struct inode *inode, struct file *filp)
    +{
    + struct rchan_buf *buf = inode->i_private;
    + kref_get(&buf->kref);
    + filp->private_data = buf;
    +
    + return nonseekable_open(inode, filp);
    +}
    +
    +/**
    + * relay_file_release - release file op for relay files
    + * @inode: the inode
    + * @filp: the file
    + *
    + * Decrements the channel refcount, as the filesystem is
    + * no longer using it.
    + */
    +static int relay_file_release(struct inode *inode, struct file *filp)
    +{
    + struct rchan_buf *buf = filp->private_data;
    + kref_put(&buf->kref, relay_remove_buf);
    +
    + return 0;
    +}
    +
    +const struct file_operations ltt_relay_file_operations = {
    + .open = relay_file_open,
    + .llseek = no_llseek,
    + .release = relay_file_release,
    +};
    +EXPORT_SYMBOL_GPL(ltt_relay_file_operations);
    +
    +static __init int relay_init(void)
    +{
    + hotcpu_notifier(relay_hotcpu_callback, 5);
    + return 0;
    +}
    +
    +module_init(relay_init);
    Index: linux-2.6-lttng/include/linux/ltt-relay.h
    ================================================== =================
    --- /dev/null 1970-01-01 00:00:00.000000000 +0000
    +++ linux-2.6-lttng/include/linux/ltt-relay.h 2008-09-27 09:22:34.000000000 -0400
    @@ -0,0 +1,291 @@
    +/*
    + * linux/include/linux/ltt-relay.h
    + *
    + * Copyright (C) 2002, 2003 - Tom Zanussi (zanussi@us.ibm.com), IBM Corp
    + * Copyright (C) 1999, 2000, 2001, 2002 - Karim Yaghmour (karim@opersys.com)
    + * Copyright (C) 2008 - Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca)
    + *
    + * CONFIG_RELAY definitions and declarations
    + */
    +
    +#ifndef _LINUX_LTT_RELAY_H
    +#define _LINUX_LTT_RELAY_H
    +
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +/* Needs a _much_ better name... */
    +#define FIX_SIZE(x) ((((x) - 1) & PAGE_MASK) + PAGE_SIZE)
    +
    +/*
    + * Tracks changes to rchan/rchan_buf structs
    + */
    +#define LTT_RELAY_CHANNEL_VERSION 8
    +
    +struct buf_page {
    + union {
    + struct {
    + unsigned long flags; /* mandatory */
    + atomic_t _count; /* mandatory */
    + struct list_head list; /* linked list of buf pages */
    + size_t offset; /* page offset in the buffer */
    + };
    + struct page page;
    + };
    +};
    +
    +/*
    + * Per-cpu relay channel buffer
    + */
    +struct rchan_buf
    +{
    + struct rchan *chan; /* associated channel */
    + wait_queue_head_t read_wait; /* reader wait queue */
    + struct timer_list timer; /* reader wake-up timer */
    + struct dentry *dentry; /* channel file dentry */
    + struct kref kref; /* channel buffer refcount */
    + struct list_head pages; /* list of buffer pages */
    + struct buf_page *wpage; /* current write page (cache) */
    + struct buf_page *hpage; /* current subbuf header page (cache) */
    + struct buf_page *rpage; /* current subbuf read page (cache) */
    + unsigned int page_count; /* number of current buffer pages */
    + unsigned int finalized; /* buffer has been finalized */
    + unsigned int cpu; /* this buf's cpu */
    +} ____cacheline_aligned;
    +
    +/*
    + * Relay channel data structure
    + */
    +struct rchan
    +{
    + u32 version; /* the version of this struct */
    + size_t subbuf_size; /* sub-buffer size */
    + size_t n_subbufs; /* number of sub-buffers per buffer */
    + size_t alloc_size; /* total buffer size allocated */
    + struct rchan_callbacks *cb; /* client callbacks */
    + struct kref kref; /* channel refcount */
    + void *private_data; /* for user-defined data */
    + struct rchan_buf *buf[NR_CPUS]; /* per-cpu channel buffers */
    + struct list_head list; /* for channel list */
    + struct dentry *parent; /* parent dentry passed to open */
    + char base_filename[NAME_MAX]; /* saved base filename */
    +};
    +
    +/*
    + * Relay channel client callbacks
    + */
    +struct rchan_callbacks
    +{
    + /*
    + * subbuf_start - called on buffer-switch to a new sub-buffer
    + * @buf: the channel buffer containing the new sub-buffer
    + * @subbuf: the start of the new sub-buffer
    + * @prev_subbuf: the start of the previous sub-buffer
    + * @prev_padding: unused space at the end of previous sub-buffer
    + *
    + * The client should return 1 to continue logging, 0 to stop
    + * logging.
    + *
    + * NOTE: subbuf_start will also be invoked when the buffer is
    + * created, so that the first sub-buffer can be initialized
    + * if necessary. In this case, prev_subbuf will be NULL.
    + *
    + * NOTE: the client can reserve bytes at the beginning of the new
    + * sub-buffer by calling subbuf_start_reserve() in this callback.
    + */
    + int (*subbuf_start) (struct rchan_buf *buf,
    + void *subbuf,
    + void *prev_subbuf,
    + size_t prev_padding);
    +
    + /*
    + * create_buf_file - create file to represent a relay channel buffer
    + * @filename: the name of the file to create
    + * @parent: the parent of the file to create
    + * @mode: the mode of the file to create
    + * @buf: the channel buffer
    + *
    + * Called during relay_open(), once for each per-cpu buffer,
    + * to allow the client to create a file to be used to
    + * represent the corresponding channel buffer. If the file is
    + * created outside of relay, the parent must also exist in
    + * that filesystem.
    + *
    + * The callback should return the dentry of the file created
    + * to represent the relay buffer.
    + *
    + * Setting the is_global outparam to a non-zero value will
    + * cause relay_open() to create a single global buffer rather
    + * than the default set of per-cpu buffers.
    + *
    + * See Documentation/filesystems/relayfs.txt for more info.
    + */
    + struct dentry *(*create_buf_file)(const char *filename,
    + struct dentry *parent,
    + int mode,
    + struct rchan_buf *buf);
    +
    + /*
    + * remove_buf_file - remove file representing a relay channel buffer
    + * @dentry: the dentry of the file to remove
    + *
    + * Called during relay_close(), once for each per-cpu buffer,
    + * to allow the client to remove a file used to represent a
    + * channel buffer.
    + *
    + * The callback should return 0 if successful, negative if not.
    + */
    + int (*remove_buf_file)(struct dentry *dentry);
    +};
    +
    +static inline struct buf_page *ltt_relay_find_prev_page(struct buf_page *page,
    + ssize_t offset)
    +{
    + list_for_each_entry_reverse(page, &page->list, list)
    + if (offset >= page->offset &&
    + offset < page->offset + PAGE_SIZE)
    + return page;
    + return NULL;
    +}
    +
    +static inline struct buf_page *ltt_relay_find_next_page(struct buf_page *page,
    + ssize_t offset)
    +{
    + list_for_each_entry(page, &page->list, list)
    + if (offset >= page->offset &&
    + offset < page->offset + PAGE_SIZE)
    + return page;
    + return NULL;
    +}
    +
    +/*
    + * Find the page containing "offset". Cache it if it is after the currently
    + * cached page.
    + */
    +static inline
    +struct buf_page *ltt_relay_cache_page(struct buf_page **page_cache,
    + struct buf_page *page, size_t offset)
    +{
    + ssize_t diff_offset;
    +
    + /*
    + * Make sure this is the page we want to write into. The current
    + * page is changed concurrently by other writers. [wrh]page are
    + * used as a cache remembering the last page written
    + * to/read/looked up for header address. No synchronization;
    + * could have to find the previous page is a nested write
    + * occured. Finding the right page is done by comparing the
    + * dest_offset with the buf_page offsets.
    + */
    + diff_offset = offset - page->offset;
    + if (unlikely(diff_offset >= PAGE_SIZE)) {
    + page = ltt_relay_find_next_page(page, offset);
    + WARN_ON(!page);
    + *page_cache = page;
    + } else if (unlikely(diff_offset < 0)) {
    + page = ltt_relay_find_prev_page(page, offset);
    + WARN_ON(!page);
    + }
    + return page;
    +}
    +
    +static inline int ltt_relay_write(struct rchan_buf *buf, size_t offset,
    + const void *src, size_t len)
    +{
    + struct buf_page *page;
    + ssize_t pagecpy, orig_len;
    +
    + orig_len = len;
    + page = buf->wpage;
    + if (unlikely(!len))
    + return 0;
    + for (; {
    + page = ltt_relay_cache_page(&buf->wpage, page, offset);
    + pagecpy = min(len, PAGE_SIZE - (offset & ~PAGE_MASK));
    + memcpy(page_address(&page->page)
    + + (offset & ~PAGE_MASK), src, pagecpy);
    + len -= pagecpy;
    + if (likely(!len))
    + break;
    + src += pagecpy;
    + offset += pagecpy;
    + /*
    + * Underlying layer should never ask for writes across
    + * subbuffers.
    + */
    + WARN_ON(offset >= buf->chan->alloc_size);
    + }
    + return orig_len;
    +}
    +
    +static inline int ltt_relay_read(struct rchan_buf *buf, size_t offset,
    + void *dest, size_t len)
    +{
    + struct buf_page *page;
    + ssize_t pagecpy, orig_len;
    +
    + orig_len = len;
    + page = buf->rpage;
    + if (unlikely(!len))
    + return 0;
    + for (; {
    + page = ltt_relay_cache_page(&buf->rpage, page, offset);
    + pagecpy = min(len, PAGE_SIZE - (offset & ~PAGE_MASK));
    + memcpy(dest, page_address(&page->page) + (offset & ~PAGE_MASK),
    + pagecpy);
    + len -= pagecpy;
    + if (likely(!len))
    + break;
    + dest += pagecpy;
    + offset += pagecpy;
    + /*
    + * Underlying layer should never ask for reads across
    + * subbuffers.
    + */
    + WARN_ON(offset >= buf->chan->alloc_size);
    + }
    + return orig_len;
    +}
    +
    +/*
    + * Return the address where a given offset is located.
    + * Should be used to get the current subbuffer header pointer. Given we know
    + * it's never on a page boundary, it's safe to write directly to this address,
    + * as long as the write is never bigger than a page size.
    + */
    +static inline void *ltt_relay_offset_address(struct rchan_buf *buf,
    + size_t offset)
    +{
    + struct buf_page *page;
    +
    + page = buf->hpage;
    + page = ltt_relay_cache_page(&buf->hpage, page, offset);
    + return page_address(&page->page) + (offset & ~PAGE_MASK);
    +}
    +
    +/*
    + * CONFIG_LTT_RELAY kernel API, ltt/ltt-relay-alloc.c
    + */
    +
    +struct rchan *ltt_relay_open(const char *base_filename,
    + struct dentry *parent,
    + size_t subbuf_size,
    + size_t n_subbufs,
    + struct rchan_callbacks *cb,
    + void *private_data);
    +extern void ltt_relay_close(struct rchan *chan);
    +
    +/*
    + * exported ltt_relay file operations, ltt/ltt-relay-alloc.c
    + */
    +extern const struct file_operations ltt_relay_file_operations;
    +
    +#endif /* _LINUX_LTT_RELAY_H */
    +
    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    On Sat, 2008-09-27 at 09:40 -0400, Mathieu Desnoyers wrote:
    > As I told Martin, I was thinking about taking an axe and moving stuff around in
    > relay. Which I just did.
    >
    > This patch reimplements relay with a linked list of pages. Provides read/write
    > wrappers which should be used to read or write from the buffers. It's the core
    > of a layered approach to the design requirements expressed by Martin and
    > discussed earlier.
    >
    > It does not provide _any_ sort of locking on buffer data. Locking should be done
    > by the caller. Given that we might think of very lightweight locking schemes, it
    > makes sense to me that the underlying buffering infrastructure supports event
    > records larger than 1 page.
    >
    > A cache saving 3 pointers is used to keep track of current page used for the
    > buffer for write, read and current subbuffer header pointer lookup. The offset
    > of each page within the buffer is saved in the page frame structure to permit
    > cache lookup without extra locking.
    >
    > TODO : Currently, no splice file operations are implemented. Should come soon.
    > The idea is to splice the buffers directly into files or to the network.
    >
    > This patch is released as early RFC. It builds, that's about it. Testing will
    > come when I implement the splice ops.


    Why? What aspects of Steve's ring-buffer interface will hinder us
    optimizing the implementation to be as light-weight as you like?

    The thing is, I'd like to see it that light as well ;-)

    As for the page-spanning entries, I think we can do that with Steve's
    system just fine, its just that Linus said its a dumb idea and Steve
    dropped it, but there is nothing fundamental stopping us from recording
    a length that is > PAGE_SIZE and copying data into the pages one at a
    time.

    Nor do I see it impossible to implement splice on top of Steve's
    ring-buffer..

    So again, why?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    On Sat, 2008-09-27 at 19:10 +0200, Peter Zijlstra wrote:
    > On Sat, 2008-09-27 at 09:40 -0400, Mathieu Desnoyers wrote:


    > > It does not provide _any_ sort of locking on buffer data. Locking should be done
    > > by the caller. Given that we might think of very lightweight locking schemes,


    Which defeats the whole purpose of the exercise, we want to provide a
    single mechanism - including locking - that is usable to all. Otherwise
    everybody gets to do the hard part themselves, which will undoubtedly
    result in many broken/suboptimal locking schemes.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
    > On Sat, 2008-09-27 at 09:40 -0400, Mathieu Desnoyers wrote:
    > > As I told Martin, I was thinking about taking an axe and moving stuff around in
    > > relay. Which I just did.
    > >
    > > This patch reimplements relay with a linked list of pages. Provides read/write
    > > wrappers which should be used to read or write from the buffers. It's the core
    > > of a layered approach to the design requirements expressed by Martin and
    > > discussed earlier.
    > >
    > > It does not provide _any_ sort of locking on buffer data. Locking should be done
    > > by the caller. Given that we might think of very lightweight locking schemes, it
    > > makes sense to me that the underlying buffering infrastructure supports event
    > > records larger than 1 page.
    > >
    > > A cache saving 3 pointers is used to keep track of current page used for the
    > > buffer for write, read and current subbuffer header pointer lookup. The offset
    > > of each page within the buffer is saved in the page frame structure to permit
    > > cache lookup without extra locking.
    > >
    > > TODO : Currently, no splice file operations are implemented. Should come soon.
    > > The idea is to splice the buffers directly into files or to the network.
    > >
    > > This patch is released as early RFC. It builds, that's about it. Testing will
    > > come when I implement the splice ops.

    >
    > Why? What aspects of Steve's ring-buffer interface will hinder us
    > optimizing the implementation to be as light-weight as you like?
    >
    > The thing is, I'd like to see it that light as well ;-)
    >


    Ok, I'll try to explain my point of view. The thing is : I want those
    binary buffers to be exported to userspace, and I fear that the approach
    taken by Steven (let's write "simple" C structure directly into the
    buffers) will in fact be much more _complex_ (due to subtle compiler
    dependencies) that doing our own event payload (event data) format.

    Also, things such as
    ring_buffer_lock: A way to lock the entire buffer.
    ring_buffer_unlock: unlock the buffer.
    will probably become a problem when trying to go for a more efficient
    locking scheme.

    ring_buffer_peek: Look at a next item in the cpu buffer.
    This kind of feature is useless for data extraction to userspace and
    poses serious synchronization concerns if you have other writers in the
    same buffer.

    Structure for event records :

    struct ring_buffer_event {
    u32 type:2, len:3, time_delta:27;
    u32 array[];
    };

    The only good thing about reserving 2 bits for event IDs in there is to
    put the most frequent events in those IDs, which is clearly not the
    case:
    RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
    header telling the length of data written into the subbuffer (what you
    guys call a "page", but what I still think might be worthy to be of
    variable size, especially with a light locking infrastructure and
    considering we might want to export this data to userspace).

    RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.

    Also, if size _really_ matters, we should just export the event ID and
    look up the event payload size through a separate table. If the payload
    consists of a binary blob, then the data structure should start by a
    payload size and then have a the actual binary blob.

    struct ring_buffer_event {
    u32 time_ext:1, evid:4, time_lsb:27;
    union {
    u32 array[];
    struct {
    u32 ext_id;
    u32 array[];
    };
    struct {
    u32 ext_time;
    u32 array[];
    };
    struct {
    u32 ext_time;
    u32 ext_id;
    u32 array[];
    };
    };

    Therefore we can encode up to 15 event IDs into this compact
    representation (we reserve ID 0xF for extended id). If we assign those
    IDs per subbuffer, it leaves plenty of room before we have to write a
    supplementary field for more IDs.

    OTOH, if we really want to have an event size in there (which is more
    solid), we could have :

    struct ring_buffer_event {
    u32 time_ext:1, time_lsb:31;
    u16 evid;
    u16 evsize;
    union {
    u32 array[];
    struct {
    u32 ext_time;
    u32 array[];
    };
    };

    That's a bit bigger, but keeps the event size in the data stream.

    Also, nobody has explained successfully why we have to encode a time
    _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
    either I fail to see the light here (I doubt it), or I am not clear
    enough when I say that we can just put the raw LSBs and compute the
    delta afterward when reading the buffers, manage to keep the same
    overflow detection power, and also keep the absolute value of the tsc
    lsb, which makes it much easier to cross-check than "deltas".

    Now for the buffer pages implementation :


    +struct buffer_page {
    + union {
    + struct {
    + unsigned long flags; /* mandatory */
    + atomic_t _count; /* mandatory */
    + u64 time_stamp; /* page time stamp */

    Why should we ever associate a time stamp with a page ??

    I see that we could save the timestamp at which a subbuffer switch
    happens (which in this patchset semantics happens to be a page), but why
    would we every want to save that in the page frame ? Especially if we
    simply write the LSBs instead of a time delta... Also, I would write
    this timestamp in a subbuffer _header_ which is exported to userspace,
    but I clealry don't see why we keep it here. In addition, it's
    completely wrong in a layered approach : if we want to switch from pages
    to video memory or to a statically allocated buffer at boot time, such
    "page-related" timestamp hacks won't do.

    + unsigned size; /* size of page data */

    This size should be exported to userspace, and therefore belongs to a
    subbuffer header, not to the page frame struct.

    + struct list_head list; /* linked list of free pages */

    "free pages" ? Are those "free" ? What does free mean here ? If Steven
    actually moves head and tail pointers to keep track of which pages data
    is written into, it will become an utter mess when we'll try to
    implement a lockless algorithm, since those are all non-atomic
    operations.

    + };
    + struct page page;
    + };
    +};

    > As for the page-spanning entries, I think we can do that with Steve's
    > system just fine, its just that Linus said its a dumb idea and Steve
    > dropped it, but there is nothing fundamental stopping us from recording
    > a length that is > PAGE_SIZE and copying data into the pages one at a
    > time.
    >
    > Nor do I see it impossible to implement splice on top of Steve's
    > ring-buffer..
    >
    > So again, why?
    >


    I'd like to separate the layer which deals with data read/write from the
    layer which deals with synchronization of data write/read to/from the
    buffers so we can eventually switch to a locking mechanism which
    provides a sane performance level, and given Steven's linked list
    implementation, it will just add unneeded locking requirements.

    Compared to this, I deal with buffer coherency with two 32 bits counters
    in LTTng : a write offset and a consumer offset. (plus a per-subbuffer
    commit count). I'd like to keep this semantic and yet support writing to
    non-vmap'd memory (which I do in the patch I propose). I'd just have to
    implement the splice operation in ltt-relay.c (the layer that sits on
    top of this patch).

    Mathieu

    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
    > On Sat, 2008-09-27 at 19:10 +0200, Peter Zijlstra wrote:
    > > On Sat, 2008-09-27 at 09:40 -0400, Mathieu Desnoyers wrote:

    >
    > > > It does not provide _any_ sort of locking on buffer data. Locking should be done
    > > > by the caller. Given that we might think of very lightweight locking schemes,

    >
    > Which defeats the whole purpose of the exercise, we want to provide a
    > single mechanism - including locking - that is usable to all. Otherwise
    > everybody gets to do the hard part themselves, which will undoubtedly
    > result in many broken/suboptimal locking schemes.
    >


    Well, this is my answer to Steven's "this is too complex" comments,
    which I suspect is really a "this is too implement to implement". Sorry
    Steven, but you do not actually propose anything to address my concerns,
    which are : I want to export this data to userspace without tricky
    dependencies on the compiler ABI. I also don't want to be limited in
    locking infrastructure implementation.

    Those are the kind of concerns that are much easier to address in a
    layered and modular implementation. If we try to do everything in the
    same C file, we end up having typing/memory management/time management
    all closely tied.

    So I am all for providing a common infrastructure which implements all
    this, but I think this infrastructure should itself be layered and
    modular.

    Also, I have something really really near to the requirements expressed
    in LTTng, which is :

    Linux Kernel Markers : Event data typing exportable to userspace without
    tricky compiler ABI dependency.
    TODO : Export marker list to debugfs.
    Allow individual marker enable/disable
    through debugfs file.
    Use per client buffer marker IDs rather
    than a global ID table.
    Export the markers IDs/format/name through
    one small buffer for each client buffer.
    ltt-relay : Buffer coherency management. TODO : splice.
    ltt-relay-alloc : Buffer allocation and read/write, without vmap.
    ltt-tracer : In-kernel API to manage trace allocation,
    start/stop.
    TODO : Currently has a statically limited set of
    buffers. Should be extended so that clients could
    register new buffers.
    ltt-control : Netlink control which calls the in-kernel
    ltt-tracer API.
    TODO : switch from netlink to debugfs.
    ltt-timestamp : Timestamping infrastructure (tsc, global
    counter). Currently supports about 6
    architectures. Has an asm-generic fallback.
    ltt-heartbeat : Deal with 32 TSC overflow by periodically writing
    an event in every buffers.
    TODO : switch to "extended time" field by keeping
    track of the previously written timestamp.

    If you think it's worthwhile, I could post a selected set of my patches
    to LKML to see the reactions. However, note that there are a few TODOs,
    so it does not address all the requirements.

    Mathieu


    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [RFC PATCH] LTTng relay buffer allocation, read, write


    On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:
    > Ok, I'll try to explain my point of view. The thing is : I want those
    > binary buffers to be exported to userspace, and I fear that the approach
    > taken by Steven (let's write "simple" C structure directly into the
    > buffers) will in fact be much more _complex_ (due to subtle compiler
    > dependencies) that doing our own event payload (event data) format.
    >
    > Also, things such as
    > ring_buffer_lock: A way to lock the entire buffer.
    > ring_buffer_unlock: unlock the buffer.
    > will probably become a problem when trying to go for a more efficient
    > locking scheme.


    I plan on nuking the above for something better in v2.

    >
    > ring_buffer_peek: Look at a next item in the cpu buffer.
    > This kind of feature is useless for data extraction to userspace and
    > poses serious synchronization concerns if you have other writers in the
    > same buffer.


    It absolutely important for ftrace.

    >
    > Structure for event records :
    >
    > struct ring_buffer_event {
    > u32 type:2, len:3, time_delta:27;
    > u32 array[];
    > };
    >
    > The only good thing about reserving 2 bits for event IDs in there is to
    > put the most frequent events in those IDs, which is clearly not the
    > case:
    > RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
    > header telling the length of data written into the subbuffer (what you
    > guys call a "page", but what I still think might be worthy to be of
    > variable size, especially with a light locking infrastructure and
    > considering we might want to export this data to userspace).


    I now have both. But I think userspace can now easily see when there
    is a place in the buffer that is empty.

    >
    > RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.
    >
    > Also, if size _really_ matters, we should just export the event ID and
    > look up the event payload size through a separate table. If the payload
    > consists of a binary blob, then the data structure should start by a
    > payload size and then have a the actual binary blob.
    >
    > struct ring_buffer_event {
    > u32 time_ext:1, evid:4, time_lsb:27;
    > union {
    > u32 array[];
    > struct {
    > u32 ext_id;
    > u32 array[];
    > };
    > struct {
    > u32 ext_time;
    > u32 array[];
    > };
    > struct {
    > u32 ext_time;
    > u32 ext_id;
    > u32 array[];
    > };
    > };
    >
    > Therefore we can encode up to 15 event IDs into this compact
    > representation (we reserve ID 0xF for extended id). If we assign those
    > IDs per subbuffer, it leaves plenty of room before we have to write a
    > supplementary field for more IDs.


    I wanted to push the event ids out of the ring buffer logic. Only a few
    internal ones are used. Otherwise, we'll have one hell of a bit enum
    table with every freaking tracing type in it. That's what I want to avoid.


    >
    > OTOH, if we really want to have an event size in there (which is more
    > solid), we could have :
    >
    > struct ring_buffer_event {
    > u32 time_ext:1, time_lsb:31;
    > u16 evid;
    > u16 evsize;
    > union {
    > u32 array[];
    > struct {
    > u32 ext_time;
    > u32 array[];
    > };
    > };


    My smallest record is 8 bytes. Yours now is 12.

    >
    > That's a bit bigger, but keeps the event size in the data stream.


    So does mine.

    >
    > Also, nobody has explained successfully why we have to encode a time
    > _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
    > either I fail to see the light here (I doubt it), or I am not clear
    > enough when I say that we can just put the raw LSBs and compute the
    > delta afterward when reading the buffers, manage to keep the same
    > overflow detection power, and also keep the absolute value of the tsc
    > lsb, which makes it much easier to cross-check than "deltas".


    Well, you need to record wraps. Probably more often then you record delta
    wraps.

    >
    > Now for the buffer pages implementation :
    >
    >
    > +struct buffer_page {
    > + union {
    > + struct {
    > + unsigned long flags; /* mandatory */
    > + atomic_t _count; /* mandatory */
    > + u64 time_stamp; /* page time stamp */
    >
    > Why should we ever associate a time stamp with a page ??


    Because we save it on overwrite, which is the default mode for ftrace.

    >
    > I see that we could save the timestamp at which a subbuffer switch
    > happens (which in this patchset semantics happens to be a page), but why
    > would we every want to save that in the page frame ? Especially if we
    > simply write the LSBs instead of a time delta... Also, I would write


    Where do you get the MSBs from?

    > this timestamp in a subbuffer _header_ which is exported to userspace,


    Well, our subbuffer header is the page frame.

    > but I clealry don't see why we keep it here. In addition, it's
    > completely wrong in a layered approach : if we want to switch from pages
    > to video memory or to a statically allocated buffer at boot time, such
    > "page-related" timestamp hacks won't do.


    As Linus said, anything bigger than a page should be outside this buffer.
    All the buffer would then need is a pointer to the data. Then the tracer
    can figure out what to do with the rest of that.

    >
    > + unsigned size; /* size of page data */
    >
    > This size should be exported to userspace, and therefore belongs to a
    > subbuffer header, not to the page frame struct.


    Again, page frame == sub buffer, period!

    >
    > + struct list_head list; /* linked list of free pages */
    >
    > "free pages" ? Are those "free" ? What does free mean here ? If Steven


    Bah, thanks. "free" is a misnomer. It should just be list of pages.

    > actually moves head and tail pointers to keep track of which pages data
    > is written into, it will become an utter mess when we'll try to
    > implement a lockless algorithm, since those are all non-atomic
    > operations.


    cmpxchg(head, old_head, head->next) ?

    >
    > + };
    > + struct page page;
    > + };
    > +};
    >
    > > As for the page-spanning entries, I think we can do that with Steve's
    > > system just fine, its just that Linus said its a dumb idea and Steve
    > > dropped it, but there is nothing fundamental stopping us from recording
    > > a length that is > PAGE_SIZE and copying data into the pages one at a
    > > time.
    > >
    > > Nor do I see it impossible to implement splice on top of Steve's
    > > ring-buffer..
    > >
    > > So again, why?
    > >

    >
    > I'd like to separate the layer which deals with data read/write from the
    > layer which deals with synchronization of data write/read to/from the
    > buffers so we can eventually switch to a locking mechanism which
    > provides a sane performance level, and given Steven's linked list
    > implementation, it will just add unneeded locking requirements.
    >
    > Compared to this, I deal with buffer coherency with two 32 bits counters
    > in LTTng : a write offset and a consumer offset. (plus a per-subbuffer
    > commit count). I'd like to keep this semantic and yet support writing to
    > non-vmap'd memory (which I do in the patch I propose). I'd just have to
    > implement the splice operation in ltt-relay.c (the layer that sits on
    > top of this patch).
    >


    -- Steve

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    On Mon, 2008-09-29 at 11:50 -0400, Mathieu Desnoyers wrote:
    > * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
    > > On Sat, 2008-09-27 at 09:40 -0400, Mathieu Desnoyers wrote:
    > > > As I told Martin, I was thinking about taking an axe and moving stuff around in
    > > > relay. Which I just did.
    > > >
    > > > This patch reimplements relay with a linked list of pages. Provides read/write
    > > > wrappers which should be used to read or write from the buffers. It's the core
    > > > of a layered approach to the design requirements expressed by Martin and
    > > > discussed earlier.
    > > >
    > > > It does not provide _any_ sort of locking on buffer data. Locking should be done
    > > > by the caller. Given that we might think of very lightweight locking schemes, it
    > > > makes sense to me that the underlying buffering infrastructure supports event
    > > > records larger than 1 page.
    > > >
    > > > A cache saving 3 pointers is used to keep track of current page used for the
    > > > buffer for write, read and current subbuffer header pointer lookup. The offset
    > > > of each page within the buffer is saved in the page frame structure to permit
    > > > cache lookup without extra locking.
    > > >
    > > > TODO : Currently, no splice file operations are implemented. Should come soon.
    > > > The idea is to splice the buffers directly into files or to the network.
    > > >
    > > > This patch is released as early RFC. It builds, that's about it. Testing will
    > > > come when I implement the splice ops.

    > >
    > > Why? What aspects of Steve's ring-buffer interface will hinder us
    > > optimizing the implementation to be as light-weight as you like?
    > >
    > > The thing is, I'd like to see it that light as well ;-)
    > >

    >
    > Ok, I'll try to explain my point of view. The thing is : I want those
    > binary buffers to be exported to userspace, and I fear that the approach
    > taken by Steven (let's write "simple" C structure directly into the
    > buffers) will in fact be much more _complex_ (due to subtle compiler
    > dependencies) that doing our own event payload (event data) format.


    The only compiler dependant thing in there is the bitfield, which could
    be recoded using regular bitops.

    I'm not seeing anything particularly worrysome about that.

    > Also, things such as
    > ring_buffer_lock: A way to lock the entire buffer.
    > ring_buffer_unlock: unlock the buffer.
    > will probably become a problem when trying to go for a more efficient
    > locking scheme.


    You can do

    stop:
    cpu_buffer->stop=1;
    smp_wmb();
    sched_sync();

    start:
    smp_wmb();
    cpu_buffer->stop=0;


    write:
    if (unlikely(smp_rmb(), cpu_buffer->stop))) {
    return -EBUSY;
    or
    while (cpu_buffer->stop)
    cpu_relax();
    }

    The read in the fast path is just a read, nothing fancy...

    > ring_buffer_peek: Look at a next item in the cpu buffer.
    > This kind of feature is useless for data extraction to userspace and
    > poses serious synchronization concerns if you have other writers in the
    > same buffer.


    Sure, its probably possible to rework the merge-iterator to use consume,
    but that would require it having storage for 1 event, which might be a
    bit messy.

    How would your locking deal with this? - it really is a requirement to
    be able to merge-sort-iterate the output from kernel space..

    > Structure for event records :
    >
    > struct ring_buffer_event {
    > u32 type:2, len:3, time_delta:27;
    > u32 array[];
    > };
    >
    > The only good thing about reserving 2 bits for event IDs in there is to
    > put the most frequent events in those IDs


    Not so, these types are buffer package types, not actual event types, I
    think thats a useful distinction.

    > , which is clearly not the
    > case:
    > RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
    > header telling the length of data written into the subbuffer (what you
    > guys call a "page", but what I still think might be worthy to be of
    > variable size, especially with a light locking infrastructure and
    > considering we might want to export this data to userspace).


    But then you'd need a sub-buffer header, which in itself takes space,
    this padding solution seems like a fine middle-ground, it only takes
    space when you need it and it free otherwise.

    The sub-buffer headers would always take space.

    > RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.
    >
    > Also, if size _really_ matters,


    You and Martin have been telling it does ;-)

    > we should just export the event ID and
    > look up the event payload size through a separate table. If the payload
    > consists of a binary blob, then the data structure should start by a
    > payload size and then have a the actual binary blob.
    >
    > struct ring_buffer_event {
    > u32 time_ext:1, evid:4, time_lsb:27;
    > union {
    > u32 array[];
    > struct {
    > u32 ext_id;
    > u32 array[];
    > };
    > struct {
    > u32 ext_time;
    > u32 array[];
    > };
    > struct {
    > u32 ext_time;
    > u32 ext_id;
    > u32 array[];
    > };
    > };
    >
    > Therefore we can encode up to 15 event IDs into this compact
    > representation (we reserve ID 0xF for extended id). If we assign those
    > IDs per subbuffer, it leaves plenty of room before we have to write a
    > supplementary field for more IDs.
    >
    > OTOH, if we really want to have an event size in there (which is more
    > solid), we could have :
    >
    > struct ring_buffer_event {
    > u32 time_ext:1, time_lsb:31;
    > u16 evid;
    > u16 evsize;
    > union {
    > u32 array[];
    > struct {
    > u32 ext_time;
    > u32 array[];
    > };
    > };
    >
    > That's a bit bigger, but keeps the event size in the data stream.


    I'm really not seeing what any of these proposals have on top of what
    Steve currently has. We have the ability to encode up to 28 bytes of
    payload in a 4 bytes header, which should suffice for most entries,
    right?

    > Also, nobody has explained successfully why we have to encode a time
    > _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
    > either I fail to see the light here (I doubt it), or I am not clear
    > enough when I say that we can just put the raw LSBs and compute the
    > delta afterward when reading the buffers, manage to keep the same
    > overflow detection power, and also keep the absolute value of the tsc
    > lsb, which makes it much easier to cross-check than "deltas".


    Still not quite understanding where you get the MSBs from, how do you
    tell if two LSBs are from the same period?

    > Now for the buffer pages implementation :
    >
    >
    > +struct buffer_page {
    > + union {
    > + struct {
    > + unsigned long flags; /* mandatory */
    > + atomic_t _count; /* mandatory */
    > + u64 time_stamp; /* page time stamp */
    >
    > Why should we ever associate a time stamp with a page ??


    Discarting the whole sub-buffer idea, it could be used to validate
    whichever time-stamp logic.

    Personally I don't particulary like the sub-buffer concept, and I don't
    think we need it.

    > I see that we could save the timestamp at which a subbuffer switch
    > happens (which in this patchset semantics happens to be a page), but why
    > would we every want to save that in the page frame ? Especially if we
    > simply write the LSBs instead of a time delta... Also, I would write
    > this timestamp in a subbuffer _header_ which is exported to userspace,


    Why have sub-buffers at all?

    > but I clealry don't see why we keep it here. In addition, it's
    > completely wrong in a layered approach : if we want to switch from pages
    > to video memory or to a statically allocated buffer at boot time, such
    > "page-related" timestamp hacks won't do.


    I don't think you _ever_ want to insert actual video memory in the
    trace, what you _might_ possibly want to do, is insert a copy of a
    frame, but that you can do with paged buffers like we have, just add an
    entry with 3840*1200*4 bytes (one screen in my case), and use sub-writes
    to copy everything to page-alinged chunks.

    There is nothing techinically prohibiting this in Steve's scheme, except
    for Linus telling you you're an idiot for doing it.

    The NOP packets you dislike allow us to align regular entries with page
    boundaries, which in turn allows this 0-copy stuff. If you use the
    sub-write iterator stuff we taked about a while ago, you can leave out
    the NOPs.

    Having both makes sense (if you want the large packets stuff), use the
    regular page aligned 0-copy stuff for regular small packets, and use the
    sub-write iterator stuff for your huge packets.

    > > As for the page-spanning entries, I think we can do that with Steve's
    > > system just fine, its just that Linus said its a dumb idea and Steve
    > > dropped it, but there is nothing fundamental stopping us from recording
    > > a length that is > PAGE_SIZE and copying data into the pages one at a
    > > time.
    > >
    > > Nor do I see it impossible to implement splice on top of Steve's
    > > ring-buffer..
    > >
    > > So again, why?
    > >

    >
    > I'd like to separate the layer which deals with data read/write from the
    > layer which deals with synchronization of data write/read to/from the
    > buffers so we can eventually switch to a locking mechanism which
    > provides a sane performance level, and given Steven's linked list
    > implementation, it will just add unneeded locking requirements.


    How does it differ from your linked list implementation? Reaslistically,
    we want but a single locking scheme for the trace buffer stuff. So
    factoring it out doesn't really make sense.

    > Compared to this, I deal with buffer coherency with two 32 bits counters
    > in LTTng : a write offset and a consumer offset. (plus a per-subbuffer
    > commit count).


    I think that can work just fine on top of Steve's stuff too, it needs a
    bit of trimming etc.. but afaict there isn't anything stopping us from
    implementing his reserve function as light as you want:

    int reserve(buffer, size, flags)
    {
    preempt_disable()
    cpu = smp_processor_id();
    cpu_buf = buffer->buffers[cpu];

    if (cpu_buf->stop) {
    ret = -EBUSY;
    goto out;
    }

    again:
    pos = cpu_buf->write_pos;
    if (flags & CONTIG) {
    new_pos = pos + size;
    } else {
    if (size > PAGE_SIZE) {
    ret = -ENOSPACE;
    goto out;
    }
    if ((pos + size) & PAGE_MASK != pos & PAGE_MASK) {
    insert_nops();
    }
    new_pos = pos + size;
    }
    if (cmpxchg(&cpu_buf->write_pos, pos, new_pos) != pos)
    goto again;
    return pos;

    out:
    preempt_enable();
    return ret;
    }

    If you put the offset&PAGE_MASK in each page-frame you can use that to
    easily detect when you need to flip to the next page.

    Which I imagine is similar to what you have... although I must admit to
    not being sure how to deal with over-write here, I guess your buffer
    must be large enough to ensure no nesting depth allows you to
    wrap-around while having an even un-commited.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    * Steven Rostedt (rostedt@goodmis.org) wrote:
    >
    > On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:
    > > Ok, I'll try to explain my point of view. The thing is : I want those
    > > binary buffers to be exported to userspace, and I fear that the approach
    > > taken by Steven (let's write "simple" C structure directly into the
    > > buffers) will in fact be much more _complex_ (due to subtle compiler
    > > dependencies) that doing our own event payload (event data) format.
    > >
    > > Also, things such as
    > > ring_buffer_lock: A way to lock the entire buffer.
    > > ring_buffer_unlock: unlock the buffer.
    > > will probably become a problem when trying to go for a more efficient
    > > locking scheme.

    >
    > I plan on nuking the above for something better in v2.
    >
    > >
    > > ring_buffer_peek: Look at a next item in the cpu buffer.
    > > This kind of feature is useless for data extraction to userspace and
    > > poses serious synchronization concerns if you have other writers in the
    > > same buffer.

    >
    > It absolutely important for ftrace.
    >


    As I already told you in person, if you have, say 16 pages for your
    buffer, you could peek into all the pages which are not being currently
    written into (15 other pages). This would permit to remove unnecessary
    writer synchronization from a cmpxchg scheme : it would only have to
    push the readers upon page switch rather than at every single even.
    Pushing the readers involves, at best, a fully synchronized cmpxchg,
    which I would prefer not to do at each and every event

    > >
    > > Structure for event records :
    > >
    > > struct ring_buffer_event {
    > > u32 type:2, len:3, time_delta:27;
    > > u32 array[];
    > > };
    > >
    > > The only good thing about reserving 2 bits for event IDs in there is to
    > > put the most frequent events in those IDs, which is clearly not the
    > > case:
    > > RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
    > > header telling the length of data written into the subbuffer (what you
    > > guys call a "page", but what I still think might be worthy to be of
    > > variable size, especially with a light locking infrastructure and
    > > considering we might want to export this data to userspace).

    >
    > I now have both. But I think userspace can now easily see when there
    > is a place in the buffer that is empty.
    >


    (I notice the comment in your v10 says that minimum size for event is 8
    bytes, isn't it rather 4 bytes ?)

    (len field set to zero for events bigger than 28 bytes.. what do you use
    for events with 0 byte payload ? I'd rather use the 28 bytes value to
    identify all events >= 28 bytes and then use the first field to identify
    the length).

    What you propose here is to duplicate the information : have the number
    of bytes used in the page header, exported to userspace, and also to
    reserve an event ID (which becomes unavailable to encode anything else)
    for "padding" event. There is clearly a space loss here.

    Also, dealing with end of page padding will become a bit complex : you
    have to check whether your padding event fits in the space left at the
    end (4-bytes aligned, but minimum event size is 8 bytes, as stated in
    your comment ? Is this true ?) Also, what happens if you plan to write
    an event bigger than 28 bytes in your subbuffer (or page) and happen to
    be at the end of the page ? You'd have to create padding which is larger
    than 28 bytes. Your v10 comments seems to indicate the design does not
    support it.

    > >
    > > RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.
    > >
    > > Also, if size _really_ matters, we should just export the event ID and
    > > look up the event payload size through a separate table. If the payload
    > > consists of a binary blob, then the data structure should start by a
    > > payload size and then have a the actual binary blob.
    > >
    > > struct ring_buffer_event {
    > > u32 time_ext:1, evid:4, time_lsb:27;
    > > union {
    > > u32 array[];
    > > struct {
    > > u32 ext_id;
    > > u32 array[];
    > > };
    > > struct {
    > > u32 ext_time;
    > > u32 array[];
    > > };
    > > struct {
    > > u32 ext_time;
    > > u32 ext_id;
    > > u32 array[];
    > > };
    > > };
    > >
    > > Therefore we can encode up to 15 event IDs into this compact
    > > representation (we reserve ID 0xF for extended id). If we assign those
    > > IDs per subbuffer, it leaves plenty of room before we have to write a
    > > supplementary field for more IDs.

    >
    > I wanted to push the event ids out of the ring buffer logic. Only a few
    > internal ones are used. Otherwise, we'll have one hell of a bit enum
    > table with every freaking tracing type in it. That's what I want to avoid.
    >


    This is why I propose to dynamically allocate event IDs through the
    markers infrastructure. Other you'll have to declare and export such
    large enumerations by hand.

    >
    > >
    > > OTOH, if we really want to have an event size in there (which is more
    > > solid), we could have :
    > >
    > > struct ring_buffer_event {
    > > u32 time_ext:1, time_lsb:31;
    > > u16 evid;
    > > u16 evsize;
    > > union {
    > > u32 array[];
    > > struct {
    > > u32 ext_time;
    > > u32 array[];
    > > };
    > > };

    >
    > My smallest record is 8 bytes. Yours now is 12.
    >


    Uh ?

    The smallest record here is 8 bytes too. It has time_ext bit, time_lsb,
    evid and evsize. It does not contain any event-specific payload.

    With this given implementation :

    struct ring_buffer_event {
    u32 time_ext:1, evid:4, time_lsb:27;
    };

    The smallest event is 4-bytes, which is twice smaller than yours.


    > > That's a bit bigger, but keeps the event size in the data stream.

    >
    > So does mine.
    >


    I see the 4 first bytes of this smallest event (the ring_buffer_event).
    What does the following 4 bytes contain exactly ?


    > >
    > > Also, nobody has explained successfully why we have to encode a time
    > > _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
    > > either I fail to see the light here (I doubt it), or I am not clear
    > > enough when I say that we can just put the raw LSBs and compute the
    > > delta afterward when reading the buffers, manage to keep the same
    > > overflow detection power, and also keep the absolute value of the tsc
    > > lsb, which makes it much easier to cross-check than "deltas".

    >
    > Well, you need to record wraps. Probably more often then you record delta
    > wraps.
    >


    Nope. You don't even need to record wraps. Say you have the kernel code
    that checks for 27 bits overflows between two consecutive events and
    write an extended timestamp if this is detected
    (that is, if (evBtsc - evAtsc > 0x7FFFFFF)).
    Therefore, you are sure that if you have two consecutive events with
    only 27 bits TSC lsb in the trace (non-extended timestamp), those are at
    most 2^27 cycles apart. Let's call them event A and event B.

    Event A would have this TSC value (only 27 LSBs are saved here) :
    0x100
    Event B would have this TSC LSB value :
    0x200

    In this case, 0x200 - 0x100 > 0
    -> no overflow

    However, if we have :
    Event A : 0x2
    Event B : 0x1

    Then we know that there has been exactly one overflow between those two
    events :
    0x1 - 0x2 <= 0
    -> overflow

    And the extreme case :
    Event A : 0x10
    Event B : 0x10
    0x10 - 0x10 <= 0
    -> overflow

    Notice that if event B, in the last case, would be just one cycle above,
    the check initially done by the kernel would have switched to an
    extended timestamp, so we would have detected the overflow anyway.

    > >
    > > Now for the buffer pages implementation :
    > >
    > >
    > > +struct buffer_page {
    > > + union {
    > > + struct {
    > > + unsigned long flags; /* mandatory */
    > > + atomic_t _count; /* mandatory */
    > > + u64 time_stamp; /* page time stamp */
    > >
    > > Why should we ever associate a time stamp with a page ??

    >
    > Because we save it on overwrite, which is the default mode for ftrace.
    >


    Sorry, I don't understand this one. You "save it" (where ?) on overwrite
    (when ? When you start overwrting a previously populated page ?) and
    this serves what purpose exactly ?

    > >
    > > I see that we could save the timestamp at which a subbuffer switch
    > > happens (which in this patchset semantics happens to be a page), but why
    > > would we every want to save that in the page frame ? Especially if we
    > > simply write the LSBs instead of a time delta... Also, I would write

    >
    > Where do you get the MSBs from?
    >


    Just to try to grasp the page frame concept : are those actually bytes
    physically located at the beginning of the page (and thus actually
    representing a page header) or are they placed elsewhere in the kernel
    memory and just serves as memory-management data for pages ?

    I doubt this struct page is actually part of the page itself, and thus I
    don't see how it exports the TSC MSBs to userspace. I would propose to
    keep a few byte at the beginning of every subbuffer (actually pages in
    your implementation) to save the full TSC.

    > > this timestamp in a subbuffer _header_ which is exported to userspace,

    >
    > Well, our subbuffer header is the page frame.
    >


    Hrm, this is actually my question above.

    > > but I clealry don't see why we keep it here. In addition, it's
    > > completely wrong in a layered approach : if we want to switch from pages
    > > to video memory or to a statically allocated buffer at boot time, such
    > > "page-related" timestamp hacks won't do.

    >
    > As Linus said, anything bigger than a page should be outside this buffer.
    > All the buffer would then need is a pointer to the data. Then the tracer
    > can figure out what to do with the rest of that.
    >


    That will lead to memory leaks in flight recorder mode. What happens if
    the data is not consumed ? We never free the referenced memory ?

    Also, how to we present that to userspace ?

    Also, if we have a corrupted buffer, lost events ? Those will call into
    kfree in the tracing hot path ? This sounds _very_ instrusive and
    dangerous. I am quite sure we'll want to instrumentat the memory
    management parts of the kernel too (I actually have patches in LTTng for
    that and kmemtrace already does it).

    > >
    > > + unsigned size; /* size of page data */
    > >
    > > This size should be exported to userspace, and therefore belongs to a
    > > subbuffer header, not to the page frame struct.

    >
    > Again, page frame == sub buffer, period!
    >


    The best argument I have heard yet is from Linus and is based on the
    fact that we'll never want to write more than 4096 bytes within locking
    such as irq disable and spinlock. This assumption does not hold with a
    cmpxchg-based locking scheme.

    Also, I kind of sensed that people were unwilling to write large events
    in the same stream where small events are put. There is no need to do
    that. We can have various buffers for the various tracers (one for
    scheduler, one for ftrace, one for network packet sniffing, one for
    usbmon...) and therefore manage to keep large events in separate
    buffers.

    > >
    > > + struct list_head list; /* linked list of free pages */
    > >
    > > "free pages" ? Are those "free" ? What does free mean here ? If Steven

    >
    > Bah, thanks. "free" is a misnomer. It should just be list of pages.
    >
    > > actually moves head and tail pointers to keep track of which pages data
    > > is written into, it will become an utter mess when we'll try to
    > > implement a lockless algorithm, since those are all non-atomic
    > > operations.

    >
    > cmpxchg(head, old_head, head->next) ?
    >


    Nope, that won't work because you'll have to cmpxchg two things :
    - The offset within the page
    - The current page pointer

    And you cannot do both at the same time, and therefore any of those two
    can fail. Dealing with page switch will therefore become racy by design.

    Mathieu

    > >
    > > + };
    > > + struct page page;
    > > + };
    > > +};
    > >
    > > > As for the page-spanning entries, I think we can do that with Steve's
    > > > system just fine, its just that Linus said its a dumb idea and Steve
    > > > dropped it, but there is nothing fundamental stopping us from recording
    > > > a length that is > PAGE_SIZE and copying data into the pages one at a
    > > > time.
    > > >
    > > > Nor do I see it impossible to implement splice on top of Steve's
    > > > ring-buffer..
    > > >
    > > > So again, why?
    > > >

    > >
    > > I'd like to separate the layer which deals with data read/write from the
    > > layer which deals with synchronization of data write/read to/from the
    > > buffers so we can eventually switch to a locking mechanism which
    > > provides a sane performance level, and given Steven's linked list
    > > implementation, it will just add unneeded locking requirements.
    > >
    > > Compared to this, I deal with buffer coherency with two 32 bits counters
    > > in LTTng : a write offset and a consumer offset. (plus a per-subbuffer
    > > commit count). I'd like to keep this semantic and yet support writing to
    > > non-vmap'd memory (which I do in the patch I propose). I'd just have to
    > > implement the splice operation in ltt-relay.c (the layer that sits on
    > > top of this patch).
    > >

    >
    > -- Steve
    >


    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [RFC PATCH] LTTng relay buffer allocation, read, write


    On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:
    > >
    > > >
    > > > ring_buffer_peek: Look at a next item in the cpu buffer.
    > > > This kind of feature is useless for data extraction to userspace and
    > > > poses serious synchronization concerns if you have other writers in the
    > > > same buffer.

    > >
    > > It absolutely important for ftrace.
    > >

    >
    > As I already told you in person, if you have, say 16 pages for your
    > buffer, you could peek into all the pages which are not being currently
    > written into (15 other pages). This would permit to remove unnecessary
    > writer synchronization from a cmpxchg scheme : it would only have to
    > push the readers upon page switch rather than at every single even.
    > Pushing the readers involves, at best, a fully synchronized cmpxchg,
    > which I would prefer not to do at each and every event


    Actually, I also have been looking at doing things lockless. Note, the
    peek operation is for the iteration mode of read, which must disable
    writes (for now). The iteration mode does not consume the reader, in fact
    there is no consumer. It is used to give a trace after the fact. No writer
    should be present anyway.

    the peek operation will not be used for a consumer read, which allows for
    both readers and writers to run at the same time.

    > >
    > > I now have both. But I think userspace can now easily see when there
    > > is a place in the buffer that is empty.
    > >

    >
    > (I notice the comment in your v10 says that minimum size for event is 8
    > bytes, isn't it rather 4 bytes ?)


    Yes it is 8 bytes minimum, 4 bytes aligned.

    >
    > (len field set to zero for events bigger than 28 bytes.. what do you use
    > for events with 0 byte payload ? I'd rather use the 28 bytes value to
    > identify all events >= 28 bytes and then use the first field to identify
    > the length).


    I force a zero type payload to have a len of 1. This keeps the minimum at
    8 bytes, with 4 bytes unused.

    >
    > What you propose here is to duplicate the information : have the number
    > of bytes used in the page header, exported to userspace, and also to
    > reserve an event ID (which becomes unavailable to encode anything else)
    > for "padding" event. There is clearly a space loss here.


    Because my focus is not on userspace. I don't care about user space apps!

    My user space app is "cat". I don't want to map ever single kind of event
    into the buffer. Keeping the length in the header keeps everything more
    robust. I don't have to worry about searching an event hash table, or
    remap what events are. Hell, I don't want to always register an event, I
    might just say, load this data and be done with it.

    The user shouldn't need to always keep track of event sizes on reading.
    Remember, the user could be something in the kernel space as well.

    >
    > Also, dealing with end of page padding will become a bit complex : you
    > have to check whether your padding event fits in the space left at the
    > end (4-bytes aligned, but minimum event size is 8 bytes, as stated in
    > your comment ? Is this true ?) Also, what happens if you plan to write
    > an event bigger than 28 bytes in your subbuffer (or page) and happen to
    > be at the end of the page ? You'd have to create padding which is larger
    > than 28 bytes. Your v10 comments seems to indicate the design does not
    > support it.


    Yep, we add more than 28 bytes of padding. Linus said this was fine. If
    you don't have enough space to fit your event, go to the next page.
    If you only have 4 bytes left, yes we have an exception to the rule, the
    padding event will be 4 bytes.

    > > > Therefore we can encode up to 15 event IDs into this compact
    > > > representation (we reserve ID 0xF for extended id). If we assign those
    > > > IDs per subbuffer, it leaves plenty of room before we have to write a
    > > > supplementary field for more IDs.

    > >
    > > I wanted to push the event ids out of the ring buffer logic. Only a few
    > > internal ones are used. Otherwise, we'll have one hell of a bit enum
    > > table with every freaking tracing type in it. That's what I want to avoid.
    > >

    >
    > This is why I propose to dynamically allocate event IDs through the
    > markers infrastructure. Other you'll have to declare and export such
    > large enumerations by hand.


    I don't want to be dependent on markers. No need to be, it's too much for
    what I need.

    > >
    > > >
    > > > OTOH, if we really want to have an event size in there (which is more
    > > > solid), we could have :
    > > >
    > > > struct ring_buffer_event {
    > > > u32 time_ext:1, time_lsb:31;
    > > > u16 evid;
    > > > u16 evsize;
    > > > union {
    > > > u32 array[];
    > > > struct {
    > > > u32 ext_time;
    > > > u32 array[];
    > > > };
    > > > };

    > >
    > > My smallest record is 8 bytes. Yours now is 12.
    > >

    >
    > Uh ?
    >
    > The smallest record here is 8 bytes too. It has time_ext bit, time_lsb,
    > evid and evsize. It does not contain any event-specific payload.


    OK, my smallest event with data is 8 bytes, yours is 12. What do I want a
    payload without data for?

    I see:

    4 bytes time_ext, time
    2 byts evid
    2 bytes evsize

    There's your 8bytes. You need anoter 4 bytes to hold data, which makes it
    12.


    >
    > With this given implementation :
    >
    > struct ring_buffer_event {
    > u32 time_ext:1, evid:4, time_lsb:27;
    > };
    >
    > The smallest event is 4-bytes, which is twice smaller than yours.


    With no data. I'm not going to worry about saving 4 bytes for a zero
    payload.

    >
    >
    > > > That's a bit bigger, but keeps the event size in the data stream.

    > >
    > > So does mine.
    > >

    >
    > I see the 4 first bytes of this smallest event (the ring_buffer_event).
    > What does the following 4 bytes contain exactly ?


    The payload.

    >
    >
    > > >
    > > > Also, nobody has explained successfully why we have to encode a time
    > > > _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
    > > > either I fail to see the light here (I doubt it), or I am not clear
    > > > enough when I say that we can just put the raw LSBs and compute the
    > > > delta afterward when reading the buffers, manage to keep the same
    > > > overflow detection power, and also keep the absolute value of the tsc
    > > > lsb, which makes it much easier to cross-check than "deltas".

    > >
    > > Well, you need to record wraps. Probably more often then you record delta
    > > wraps.
    > >

    >
    > Nope. You don't even need to record wraps. Say you have the kernel code
    > that checks for 27 bits overflows between two consecutive events and
    > write an extended timestamp if this is detected
    > (that is, if (evBtsc - evAtsc > 0x7FFFFFF)).
    > Therefore, you are sure that if you have two consecutive events with
    > only 27 bits TSC lsb in the trace (non-extended timestamp), those are at
    > most 2^27 cycles apart. Let's call them event A and event B.
    >
    > Event A would have this TSC value (only 27 LSBs are saved here) :
    > 0x100
    > Event B would have this TSC LSB value :
    > 0x200
    >
    > In this case, 0x200 - 0x100 > 0
    > -> no overflow
    >
    > However, if we have :
    > Event A : 0x2
    > Event B : 0x1
    >
    > Then we know that there has been exactly one overflow between those two
    > events :
    > 0x1 - 0x2 <= 0
    > -> overflow
    >
    > And the extreme case :
    > Event A : 0x10
    > Event B : 0x10
    > 0x10 - 0x10 <= 0
    > -> overflow
    >
    > Notice that if event B, in the last case, would be just one cycle above,
    > the check initially done by the kernel would have switched to an
    > extended timestamp, so we would have detected the overflow anyway.


    I actually don't care which method we use. This is something that we can
    change later.

    >
    > > >
    > > > Now for the buffer pages implementation :
    > > >
    > > >
    > > > +struct buffer_page {
    > > > + union {
    > > > + struct {
    > > > + unsigned long flags; /* mandatory */
    > > > + atomic_t _count; /* mandatory */
    > > > + u64 time_stamp; /* page time stamp */
    > > >
    > > > Why should we ever associate a time stamp with a page ??

    > >
    > > Because we save it on overwrite, which is the default mode for ftrace.
    > >

    >
    > Sorry, I don't understand this one. You "save it" (where ?) on overwrite
    > (when ? When you start overwrting a previously populated page ?) and
    > this serves what purpose exactly ?


    Actually, when we write to a new page, we save the timestamp. We don't
    care about pages we overwrite, since the next read page still has the
    timestamp value that was record when it was first written to.

    I'm saying, every page needs a timestamp value that can be retrieved,
    otherwise, how do you retrieve the timestamp of a page that was
    overwritten several times since the trace was started and never was read.

    >
    > > >
    > > > I see that we could save the timestamp at which a subbuffer switch
    > > > happens (which in this patchset semantics happens to be a page), but why
    > > > would we every want to save that in the page frame ? Especially if we
    > > > simply write the LSBs instead of a time delta... Also, I would write

    > >
    > > Where do you get the MSBs from?
    > >

    >
    > Just to try to grasp the page frame concept : are those actually bytes
    > physically located at the beginning of the page (and thus actually
    > representing a page header) or are they placed elsewhere in the kernel
    > memory and just serves as memory-management data for pages ?


    The later. Yes the page is the page frame management infrastructure.
    When the page is allocated somewhere with get_free_page, most of that
    structure is not being used. When the page is free or being used as
    cache, then that structure matters.

    >
    > I doubt this struct page is actually part of the page itself, and thus I
    > don't see how it exports the TSC MSBs to userspace. I would propose to
    > keep a few byte at the beginning of every subbuffer (actually pages in
    > your implementation) to save the full TSC.


    Actually, my first versions had that. But I moved it. I'll come up with
    a way to get this info later.

    >
    > > > this timestamp in a subbuffer _header_ which is exported to userspace,

    > >
    > > Well, our subbuffer header is the page frame.
    > >

    >
    > Hrm, this is actually my question above.


    hehe, answered above.

    >
    > > > but I clealry don't see why we keep it here. In addition, it's
    > > > completely wrong in a layered approach : if we want to switch from pages
    > > > to video memory or to a statically allocated buffer at boot time, such
    > > > "page-related" timestamp hacks won't do.

    > >
    > > As Linus said, anything bigger than a page should be outside this buffer.
    > > All the buffer would then need is a pointer to the data. Then the tracer
    > > can figure out what to do with the rest of that.
    > >

    >
    > That will lead to memory leaks in flight recorder mode. What happens if
    > the data is not consumed ? We never free the referenced memory ?


    IIRC, Linus had for these records:

    array[0] = pointer to big data
    array[1] = pointer to free function.

    When the data is consumed in overwrite mode, we can call a free function
    to get rid of it.

    >
    > Also, how to we present that to userspace ?


    The tracer layer will make something available for userspace. The ring
    buffer layer will not.

    >
    > Also, if we have a corrupted buffer, lost events ? Those will call into
    > kfree in the tracing hot path ? This sounds _very_ instrusive and
    > dangerous. I am quite sure we'll want to instrumentat the memory
    > management parts of the kernel too (I actually have patches in LTTng for
    > that and kmemtrace already does it).


    We could add it to a per_cpu list that can free these later.

    >
    > > >
    > > > + unsigned size; /* size of page data */
    > > >
    > > > This size should be exported to userspace, and therefore belongs to a
    > > > subbuffer header, not to the page frame struct.

    > >
    > > Again, page frame == sub buffer, period!
    > >

    >
    > The best argument I have heard yet is from Linus and is based on the
    > fact that we'll never want to write more than 4096 bytes within locking
    > such as irq disable and spinlock. This assumption does not hold with a
    > cmpxchg-based locking scheme.
    >
    > Also, I kind of sensed that people were unwilling to write large events
    > in the same stream where small events are put. There is no need to do
    > that. We can have various buffers for the various tracers (one for
    > scheduler, one for ftrace, one for network packet sniffing, one for
    > usbmon...) and therefore manage to keep large events in separate
    > buffers.


    Make your tracer layer have a "large event" buffer. This current code
    will focus on little events.

    >
    > > >
    > > > + struct list_head list; /* linked list of free pages */
    > > >
    > > > "free pages" ? Are those "free" ? What does free mean here ? If Steven

    > >
    > > Bah, thanks. "free" is a misnomer. It should just be list of pages.
    > >
    > > > actually moves head and tail pointers to keep track of which pages data
    > > > is written into, it will become an utter mess when we'll try to
    > > > implement a lockless algorithm, since those are all non-atomic
    > > > operations.

    > >
    > > cmpxchg(head, old_head, head->next) ?
    > >

    >
    > Nope, that won't work because you'll have to cmpxchg two things :
    > - The offset within the page
    > - The current page pointer


    BTW, I remember telling you that I want the reader and writer to loop
    on the same page. I've changed my mind. This code will push the reader
    to the next page on overwrite. That is, a writer will not write on a page
    that a reader is on (unless the reader is behind it).

    Some assumptions that I plan on making.

    1) A writer can only write to the CPU buffer it is on.
    2) All readers must be synchronized with each other. Two readers can
    not read from the same buffer at the same time (at the API level).

    Now, what this gives us is.

    1) stack order of writes.

    All interrupts and NMIs will enter and leave in stack fashion.
    Writers will always disable preemption.
    We do not need to worry about SMP for writes.
    That is, if you reserve data and atomically push the head forward
    if the head goes past the end of page, One, you check if your
    start of head (head - length) is still on the page. If it is
    you add your padding (you already reserved it), then you atomically
    push the head forward. Then you start the process again.
    If the start of the head (head - length) is not on the page, that
    means that an IRQ or NMI came in and pushed it before you.


    2) Only need to deal with one reader at a time.

    The one reader just helps make the reader side easier. It doesn't make
    much difference with the writer side.

    I've been thinking of various ways to handle the reader.

    1) only consumer mode handle synchronization with writers.
    The iterator mode (read a trace but do not consume) must disable
    writes. This just makes everything easier, and is usually what you
    want, otherwise the read is just corrupted. ftrace currently disables
    recording when the trace is being read (except for the consumer
    trace_pipe).

    2) Have an extra sub buffer. This might be what you do. That is, have
    an extra page, and when a reader starts, it will swap the page from
    the buffer with the extra page. The writer will then write on this
    new page.

    We could use atomic inc on the pages to detect if the page changed
    since we copied it, and if we fail try again.

    This will also allow you to take these extra pages and push them to a
    file.

    This is just the jest of things to come in v2.

    -- Steve

    >
    > And you cannot do both at the same time, and therefore any of those two
    > can fail. Dealing with page switch will therefore become racy by design.
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [RFC PATCH] LTTng relay buffer allocation, read, write


    On Mon, 29 Sep 2008, Steven Rostedt wrote:
    > We do not need to worry about SMP for writes.
    > That is, if you reserve data and atomically push the head forward
    > if the head goes past the end of page, One, you check if your
    > start of head (head - length) is still on the page. If it is
    > you add your padding (you already reserved it), then you atomically
    > push the head forward. Then you start the process again.
    > If the start of the head (head - length) is not on the page, that
    > means that an IRQ or NMI came in and pushed it before you.


    I forgot to mention one important detail. The "head" index will stay
    on the page frame. That way we do not need to figure out which
    head_page we are on. We grab the head_page, we atomically
    (local_inc_return) the head pointer on that page. If the return value is
    still on the page, we succeeded. We can also increment a value on this
    page frame that will prevent recording if we somehow overflowed the buffer
    before relinquishing the stack.

    That is

    reserve_event()

    IRQ->
    reserve_event();

    [...]

    IRQ->reserve_event() came back to original head!

    Here we do not have a big enough buffer, and this is just stupid ;-)
    We would drop packets in this case, and should drop the guy on his
    head who came up with the too small buffer.

    -- Steve

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
    > On Mon, 2008-09-29 at 11:50 -0400, Mathieu Desnoyers wrote:
    > > * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
    > > > On Sat, 2008-09-27 at 09:40 -0400, Mathieu Desnoyers wrote:
    > > > > As I told Martin, I was thinking about taking an axe and moving stuff around in
    > > > > relay. Which I just did.
    > > > >
    > > > > This patch reimplements relay with a linked list of pages. Provides read/write
    > > > > wrappers which should be used to read or write from the buffers. It's the core
    > > > > of a layered approach to the design requirements expressed by Martin and
    > > > > discussed earlier.
    > > > >
    > > > > It does not provide _any_ sort of locking on buffer data. Locking should be done
    > > > > by the caller. Given that we might think of very lightweight locking schemes, it
    > > > > makes sense to me that the underlying buffering infrastructure supports event
    > > > > records larger than 1 page.
    > > > >
    > > > > A cache saving 3 pointers is used to keep track of current page used for the
    > > > > buffer for write, read and current subbuffer header pointer lookup. The offset
    > > > > of each page within the buffer is saved in the page frame structure to permit
    > > > > cache lookup without extra locking.
    > > > >
    > > > > TODO : Currently, no splice file operations are implemented. Should come soon.
    > > > > The idea is to splice the buffers directly into files or to the network.
    > > > >
    > > > > This patch is released as early RFC. It builds, that's about it. Testing will
    > > > > come when I implement the splice ops.
    > > >
    > > > Why? What aspects of Steve's ring-buffer interface will hinder us
    > > > optimizing the implementation to be as light-weight as you like?
    > > >
    > > > The thing is, I'd like to see it that light as well ;-)
    > > >

    > >
    > > Ok, I'll try to explain my point of view. The thing is : I want those
    > > binary buffers to be exported to userspace, and I fear that the approach
    > > taken by Steven (let's write "simple" C structure directly into the
    > > buffers) will in fact be much more _complex_ (due to subtle compiler
    > > dependencies) that doing our own event payload (event data) format.

    >
    > The only compiler dependant thing in there is the bitfield, which could
    > be recoded using regular bitops.
    >
    > I'm not seeing anything particularly worrysome about that.
    >



    +struct ring_buffer_event {
    + u32 type:2, len:3, time_delta:27;
    + u32 array[];
    +};

    I am mostly talking about what goes in the array[] part of the event.
    This will be where the complex typing will occur.


    > > Also, things such as
    > > ring_buffer_lock: A way to lock the entire buffer.
    > > ring_buffer_unlock: unlock the buffer.
    > > will probably become a problem when trying to go for a more efficient
    > > locking scheme.

    >
    > You can do
    >
    > stop:
    > cpu_buffer->stop=1;
    > smp_wmb();
    > sched_sync();
    >
    > start:
    > smp_wmb();
    > cpu_buffer->stop=0;
    >
    >
    > write:
    > if (unlikely(smp_rmb(), cpu_buffer->stop))) {
    > return -EBUSY;
    > or
    > while (cpu_buffer->stop)
    > cpu_relax();
    > }
    >
    > The read in the fast path is just a read, nothing fancy...
    >


    That will stop tracing if you want to read a trace at the same time you
    want to write it, which does not permit continuous streaming.

    I agree that such start/stop is needed to control tracing on/off, but it
    should be separate from the actual writer/reader locking.

    > > ring_buffer_peek: Look at a next item in the cpu buffer.
    > > This kind of feature is useless for data extraction to userspace and
    > > poses serious synchronization concerns if you have other writers in the
    > > same buffer.

    >
    > Sure, its probably possible to rework the merge-iterator to use consume,
    > but that would require it having storage for 1 event, which might be a
    > bit messy.
    >


    Not sure why you'd need storage for 1 event ?

    > How would your locking deal with this? - it really is a requirement to
    > be able to merge-sort-iterate the output from kernel space..
    >


    Sure, I plan to support this. This would be a subset of userspace data
    extraction. The in-kernel consumer would be a consumer just like a
    splice operation which would extract buffers one event at a time. Rather
    than doing a splice to write the data to disk or to the network, this
    special in-kernel consumer would merge-sort events from the various
    buffers it is connected to and pretty-print one event record at a time
    to a seq_file.

    Instead of peeking in the "next event", which implies closely coupled
    locking with the writer, it would therefore have to ability to consume
    all the oldest subbuffers which are not being written to.

    Being a standard consumer, it would therefore increment the consumer
    offset, which is synchronized by the writer at each subbuffer (page)
    switch only.

    > > Structure for event records :
    > >
    > > struct ring_buffer_event {
    > > u32 type:2, len:3, time_delta:27;
    > > u32 array[];
    > > };
    > >
    > > The only good thing about reserving 2 bits for event IDs in there is to
    > > put the most frequent events in those IDs

    >
    > Not so, these types are buffer package types, not actual event types, I
    > think thats a useful distinction.
    >


    I actually think we only need 1 bit to represent extended timestamp
    headers. The rest of these event types are just unneeded and consume
    precious ID space for nothing. I have not seen any justification telling
    why reserving these events actually does something that cannot be done
    by the extended timestamp header or by reserving a small header at the
    beginning of the subbuffer (page).

    > > , which is clearly not the
    > > case:
    > > RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
    > > header telling the length of data written into the subbuffer (what you
    > > guys call a "page", but what I still think might be worthy to be of
    > > variable size, especially with a light locking infrastructure and
    > > considering we might want to export this data to userspace).

    >
    > But then you'd need a sub-buffer header, which in itself takes space,
    > this padding solution seems like a fine middle-ground, it only takes
    > space when you need it and it free otherwise.
    >
    > The sub-buffer headers would always take space.
    >


    Note that you already need to save the timestamp in the subbuffer
    header. The question is : should we also save the unused size.

    If we don't, keeping a padding event will likely :
    - Rarely save space, given the variable size nature of event records,
    the likeliness of needing padding is pretty high. A padding event is
    bigger than just writing the unused size anyway, given the unused size
    can be written in a few bits for a page, but the padding event
    requires at least 8 bytes.
    - Add complexity to the buffer structure; we have to make sure there is
    enough room in the page to write the padding event, have to make sure
    we support large padding events (> 28 bytes)...
    - It also reserves an event ID, which could be used for other purposes.
    That means if we have 8 event IDs we which to write in a specific
    buffer (let's suppose they have a 0 bytes payload), we therefore need
    to increase the event size from 4-bytes to 8-bytes just because we
    reserved those IDs for "internal" (useless) purpose. OTOH, if we keep
    those IDs free to encode tracer-specific IDs, we can pack a lot of
    data in 4-bytes events rather than 8-bytes, which actually doubles the
    amount of events we can record.


    > > RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.
    > >
    > > Also, if size _really_ matters,

    >
    > You and Martin have been telling it does ;-)
    >
    > > we should just export the event ID and
    > > look up the event payload size through a separate table. If the payload
    > > consists of a binary blob, then the data structure should start by a
    > > payload size and then have a the actual binary blob.
    > >
    > > struct ring_buffer_event {
    > > u32 time_ext:1, evid:4, time_lsb:27;
    > > union {
    > > u32 array[];
    > > struct {
    > > u32 ext_id;
    > > u32 array[];
    > > };
    > > struct {
    > > u32 ext_time;
    > > u32 array[];
    > > };
    > > struct {
    > > u32 ext_time;
    > > u32 ext_id;
    > > u32 array[];
    > > };
    > > };
    > >
    > > Therefore we can encode up to 15 event IDs into this compact
    > > representation (we reserve ID 0xF for extended id). If we assign those
    > > IDs per subbuffer, it leaves plenty of room before we have to write a
    > > supplementary field for more IDs.
    > >
    > > OTOH, if we really want to have an event size in there (which is more
    > > solid), we could have :
    > >
    > > struct ring_buffer_event {
    > > u32 time_ext:1, time_lsb:31;
    > > u16 evid;
    > > u16 evsize;
    > > union {
    > > u32 array[];
    > > struct {
    > > u32 ext_time;
    > > u32 array[];
    > > };
    > > };
    > >
    > > That's a bit bigger, but keeps the event size in the data stream.

    >
    > I'm really not seeing what any of these proposals have on top of what
    > Steve currently has. We have the ability to encode up to 28 bytes of
    > payload in a 4 bytes header, which should suffice for most entries,
    > right?
    >


    The common entries would be under 28 bytes, right. But the thing is that
    Steven's proposal uses 8 bytes for the _smallest_ event when we can in
    fact cut that down to 4 bytes.


    > > Also, nobody has explained successfully why we have to encode a time
    > > _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
    > > either I fail to see the light here (I doubt it), or I am not clear
    > > enough when I say that we can just put the raw LSBs and compute the
    > > delta afterward when reading the buffers, manage to keep the same
    > > overflow detection power, and also keep the absolute value of the tsc
    > > lsb, which makes it much easier to cross-check than "deltas".

    >
    > Still not quite understanding where you get the MSBs from, how do you
    > tell if two LSBs are from the same period?
    >


    See my answer to Steven for this.

    > > Now for the buffer pages implementation :
    > >
    > >
    > > +struct buffer_page {
    > > + union {
    > > + struct {
    > > + unsigned long flags; /* mandatory */
    > > + atomic_t _count; /* mandatory */
    > > + u64 time_stamp; /* page time stamp */
    > >
    > > Why should we ever associate a time stamp with a page ??

    >
    > Discarting the whole sub-buffer idea, it could be used to validate
    > whichever time-stamp logic.
    >
    > Personally I don't particulary like the sub-buffer concept, and I don't
    > think we need it.
    >


    Depends on the use-cases I guess, and on the locking used. My question
    about this time-stamp is : shouldn't we, instead, put it in a page
    header (exposed to userspace) rather than in this page frame, which I
    believe is in a separate data structure (and not in the page per se) ?

    > > I see that we could save the timestamp at which a subbuffer switch
    > > happens (which in this patchset semantics happens to be a page), but why
    > > would we every want to save that in the page frame ? Especially if we
    > > simply write the LSBs instead of a time delta... Also, I would write
    > > this timestamp in a subbuffer _header_ which is exported to userspace,

    >
    > Why have sub-buffers at all?
    >


    To support events larger than 4kB. Also, to support different backing
    storage for the buffers without keeping a limitation specific a single
    page-tied implementation of those.

    > > but I clealry don't see why we keep it here. In addition, it's
    > > completely wrong in a layered approach : if we want to switch from pages
    > > to video memory or to a statically allocated buffer at boot time, such
    > > "page-related" timestamp hacks won't do.

    >
    > I don't think you _ever_ want to insert actual video memory in the
    > trace, what you _might_ possibly want to do, is insert a copy of a
    > frame, but that you can do with paged buffers like we have, just add an
    > entry with 3840*1200*4 bytes (one screen in my case), and use sub-writes
    > to copy everything to page-alinged chunks.
    >


    No, no What I meant is : some people would like to _reserve_ part of
    their video card memory so they can use it as backing storage for the
    _buffers_. The benefit of doing that is that those buffers will survive
    a hot reboot. Rather useful to debug kernel crashes.

    Copying larger event payloads would actually require to reserve multiple
    events in the trace, and would require a cookie or some locking to tell
    that a given event is actually the follow-up of the previous one,
    because with more evolved locking schemes, each event space reservation
    is atomic and we can therefore not assume that a single payload splitted
    in two events will be put contiguously in the trace buffers (can be
    interrupted, preempted...).

    > There is nothing techinically prohibiting this in Steve's scheme, except
    > for Linus telling you you're an idiot for doing it.
    >
    > The NOP packets you dislike allow us to align regular entries with page
    > boundaries, which in turn allows this 0-copy stuff. If you use the
    > sub-write iterator stuff we taked about a while ago, you can leave out
    > the NOPs.
    >
    > Having both makes sense (if you want the large packets stuff), use the
    > regular page aligned 0-copy stuff for regular small packets, and use the
    > sub-write iterator stuff for your huge packets.
    >


    Sorry, I don't understand what the NOP packet gives you that the used
    bytes amount in the subbuffer (page) header doesn't provide (given my
    explanation above).


    > > > As for the page-spanning entries, I think we can do that with Steve's
    > > > system just fine, its just that Linus said its a dumb idea and Steve
    > > > dropped it, but there is nothing fundamental stopping us from recording
    > > > a length that is > PAGE_SIZE and copying data into the pages one at a
    > > > time.
    > > >
    > > > Nor do I see it impossible to implement splice on top of Steve's
    > > > ring-buffer..
    > > >
    > > > So again, why?
    > > >

    > >
    > > I'd like to separate the layer which deals with data read/write from the
    > > layer which deals with synchronization of data write/read to/from the
    > > buffers so we can eventually switch to a locking mechanism which
    > > provides a sane performance level, and given Steven's linked list
    > > implementation, it will just add unneeded locking requirements.

    >
    > How does it differ from your linked list implementation? Reaslistically,
    > we want but a single locking scheme for the trace buffer stuff. So
    > factoring it out doesn't really make sense.
    >


    My linked list implementation does not assume we have to do a "disable
    interrupts" around the whole pointer manipulation, simply because there
    isn't any complicated manipulation done. Actually, the only linked list
    pointer manipulation I need to do a to atomically save one of three
    pointers to cache the current page I am writing to/reading from/getting
    the header pointer from.

    > > Compared to this, I deal with buffer coherency with two 32 bits counters
    > > in LTTng : a write offset and a consumer offset. (plus a per-subbuffer
    > > commit count).

    >
    > I think that can work just fine on top of Steve's stuff too, it needs a
    > bit of trimming etc.. but afaict there isn't anything stopping us from
    > implementing his reserve function as light as you want:
    >
    > int reserve(buffer, size, flags)
    > {
    > preempt_disable()
    > cpu = smp_processor_id();
    > cpu_buf = buffer->buffers[cpu];
    >
    > if (cpu_buf->stop) {
    > ret = -EBUSY;
    > goto out;
    > }
    >
    > again:
    > pos = cpu_buf->write_pos;
    > if (flags & CONTIG) {
    > new_pos = pos + size;
    > } else {
    > if (size > PAGE_SIZE) {
    > ret = -ENOSPACE;
    > goto out;
    > }
    > if ((pos + size) & PAGE_MASK != pos & PAGE_MASK) {
    > insert_nops();
    > }
    > new_pos = pos + size;
    > }
    > if (cmpxchg(&cpu_buf->write_pos, pos, new_pos) != pos)


    How do you deal with his call to __rb_reserve_next which deals with page
    change and plays with tail_page pointer ? This is where the whole
    problem lays; you have to atomically update more than a single atomic
    variable.

    > goto again;
    > return pos;
    >
    > out:
    > preempt_enable();
    > return ret;
    > }
    >
    > If you put the offset&PAGE_MASK in each page-frame you can use that to
    > easily detect when you need to flip to the next page.
    >


    Exactly what I have in this patch.

    > Which I imagine is similar to what you have... although I must admit to
    > not being sure how to deal with over-write here, I guess your buffer
    > must be large enough to ensure no nesting depth allows you to
    > wrap-around while having an even un-commited.
    >


    I deal with overwrite by looking that the commit count value of the
    subbuffer (page) I am about to start writing into. If it's not a
    multiple of subbuffer (page) size, then I consider the content of that
    specific subbuffer as corrupted. And yes, it implies that nesting that
    wraps around the number of subbuffers would cause (detected) data
    corruption. If it ever happens, the corrupted subbuffers count will be
    incremented and the given subbuffer will be overwritten (the subbuffer
    commit count is reequilibrated at this moment). When the stale writer
    will resume its execution, it will increment the commit count and
    therefore cause a second subbuffer corruption. In the end, 2 subbuffers
    will be dropped if this kind of situation happens. But I guess the rule
    is to make sure nested writes does not overflow the complete buffer
    size, or to simply make the buffers large enough.

    Mathieu


    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [RFC PATCH] LTTng relay buffer allocation, read, write



    On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:
    > > >
    > > > Ok, I'll try to explain my point of view. The thing is : I want those
    > > > binary buffers to be exported to userspace, and I fear that the approach
    > > > taken by Steven (let's write "simple" C structure directly into the
    > > > buffers) will in fact be much more _complex_ (due to subtle compiler
    > > > dependencies) that doing our own event payload (event data) format.

    > >
    > > The only compiler dependant thing in there is the bitfield, which could
    > > be recoded using regular bitops.
    > >
    > > I'm not seeing anything particularly worrysome about that.
    > >

    >
    >
    > +struct ring_buffer_event {
    > + u32 type:2, len:3, time_delta:27;
    > + u32 array[];
    > +};
    >
    > I am mostly talking about what goes in the array[] part of the event.
    > This will be where the complex typing will occur.


    Only on the reader side.

    >
    >
    > > > Also, things such as
    > > > ring_buffer_lock: A way to lock the entire buffer.
    > > > ring_buffer_unlock: unlock the buffer.
    > > > will probably become a problem when trying to go for a more efficient
    > > > locking scheme.

    > >
    > > You can do
    > >
    > > stop:
    > > cpu_buffer->stop=1;
    > > smp_wmb();
    > > sched_sync();
    > >
    > > start:
    > > smp_wmb();
    > > cpu_buffer->stop=0;
    > >
    > >
    > > write:
    > > if (unlikely(smp_rmb(), cpu_buffer->stop))) {
    > > return -EBUSY;
    > > or
    > > while (cpu_buffer->stop)
    > > cpu_relax();
    > > }
    > >
    > > The read in the fast path is just a read, nothing fancy...
    > >

    >
    > That will stop tracing if you want to read a trace at the same time you
    > want to write it, which does not permit continuous streaming.


    Agree, we can do better than that.

    >
    > I agree that such start/stop is needed to control tracing on/off, but it
    > should be separate from the actual writer/reader locking.


    yep

    >
    > > > ring_buffer_peek: Look at a next item in the cpu buffer.
    > > > This kind of feature is useless for data extraction to userspace and
    > > > poses serious synchronization concerns if you have other writers in the
    > > > same buffer.

    > >
    > > Sure, its probably possible to rework the merge-iterator to use consume,
    > > but that would require it having storage for 1 event, which might be a
    > > bit messy.
    > >

    >
    > Not sure why you'd need storage for 1 event ?
    >
    > > How would your locking deal with this? - it really is a requirement to
    > > be able to merge-sort-iterate the output from kernel space..
    > >

    >
    > Sure, I plan to support this. This would be a subset of userspace data
    > extraction. The in-kernel consumer would be a consumer just like a
    > splice operation which would extract buffers one event at a time. Rather
    > than doing a splice to write the data to disk or to the network, this
    > special in-kernel consumer would merge-sort events from the various
    > buffers it is connected to and pretty-print one event record at a time
    > to a seq_file.
    >
    > Instead of peeking in the "next event", which implies closely coupled
    > locking with the writer, it would therefore have to ability to consume
    > all the oldest subbuffers which are not being written to.


    Remember, I plan on supporting two modes. An iterator mode and a consumer
    producer mode. Do you have an iterator mode? That's what most of the
    ftrace tools use.

    The iterator mode will disable writing and use the "peek" mode. The
    consumer mode will not, and have its own ways to read. As is right
    now in the buffer code.

    >
    > Being a standard consumer, it would therefore increment the consumer
    > offset, which is synchronized by the writer at each subbuffer (page)
    > switch only.
    >
    > > > Structure for event records :
    > > >
    > > > struct ring_buffer_event {
    > > > u32 type:2, len:3, time_delta:27;
    > > > u32 array[];
    > > > };
    > > >
    > > > The only good thing about reserving 2 bits for event IDs in there is to
    > > > put the most frequent events in those IDs

    > >
    > > Not so, these types are buffer package types, not actual event types, I
    > > think thats a useful distinction.
    > >

    >
    > I actually think we only need 1 bit to represent extended timestamp
    > headers. The rest of these event types are just unneeded and consume
    > precious ID space for nothing. I have not seen any justification telling
    > why reserving these events actually does something that cannot be done
    > by the extended timestamp header or by reserving a small header at the
    > beginning of the subbuffer (page).


    Perhaps we can look at this for v2.

    >
    > > > , which is clearly not the
    > > > case:
    > > > RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
    > > > header telling the length of data written into the subbuffer (what you
    > > > guys call a "page", but what I still think might be worthy to be of
    > > > variable size, especially with a light locking infrastructure and
    > > > considering we might want to export this data to userspace).

    > >
    > > But then you'd need a sub-buffer header, which in itself takes space,
    > > this padding solution seems like a fine middle-ground, it only takes
    > > space when you need it and it free otherwise.
    > >
    > > The sub-buffer headers would always take space.
    > >

    >
    > Note that you already need to save the timestamp in the subbuffer
    > header. The question is : should we also save the unused size.


    I do that too. Yes I have duplicate info. But I still like the robustness
    of the padding event.

    >
    > If we don't, keeping a padding event will likely :
    > - Rarely save space, given the variable size nature of event records,
    > the likeliness of needing padding is pretty high. A padding event is
    > bigger than just writing the unused size anyway, given the unused size
    > can be written in a few bits for a page, but the padding event
    > requires at least 8 bytes.


    False. It is a "special" event that can actually be in 4 bytes. Which is
    what would be wasted anyway. Lenght is ignored by the padding event. Its
    basically just another way to make sure we are not reading useless data.

    From the rb_event_length function:

    + switch (event->type) {
    + case RINGBUF_TYPE_PADDING:
    + /* undefined */
    + return -1;


    > - Add complexity to the buffer structure; we have to make sure there is
    > enough room in the page to write the padding event, have to make sure
    > we support large padding events (> 28 bytes)...


    False as described above.

    > - It also reserves an event ID, which could be used for other purposes.
    > That means if we have 8 event IDs we which to write in a specific
    > buffer (let's suppose they have a 0 bytes payload), we therefore need
    > to increase the event size from 4-bytes to 8-bytes just because we
    > reserved those IDs for "internal" (useless) purpose. OTOH, if we keep
    > those IDs free to encode tracer-specific IDs, we can pack a lot of
    > data in 4-bytes events rather than 8-bytes, which actually doubles the
    > amount of events we can record.


    A 4 byte dataless payload is useless anyway.

    >
    >
    > > > RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.
    > > >
    > > > Also, if size _really_ matters,

    > >
    > > You and Martin have been telling it does ;-)
    > >
    > > > we should just export the event ID and
    > > > look up the event payload size through a separate table. If the payload
    > > > consists of a binary blob, then the data structure should start by a
    > > > payload size and then have a the actual binary blob.
    > > >
    > > > struct ring_buffer_event {
    > > > u32 time_ext:1, evid:4, time_lsb:27;
    > > > union {
    > > > u32 array[];
    > > > struct {
    > > > u32 ext_id;
    > > > u32 array[];
    > > > };
    > > > struct {
    > > > u32 ext_time;
    > > > u32 array[];
    > > > };
    > > > struct {
    > > > u32 ext_time;
    > > > u32 ext_id;
    > > > u32 array[];
    > > > };
    > > > };
    > > >
    > > > Therefore we can encode up to 15 event IDs into this compact
    > > > representation (we reserve ID 0xF for extended id). If we assign those
    > > > IDs per subbuffer, it leaves plenty of room before we have to write a
    > > > supplementary field for more IDs.
    > > >
    > > > OTOH, if we really want to have an event size in there (which is more
    > > > solid), we could have :
    > > >
    > > > struct ring_buffer_event {
    > > > u32 time_ext:1, time_lsb:31;
    > > > u16 evid;
    > > > u16 evsize;
    > > > union {
    > > > u32 array[];
    > > > struct {
    > > > u32 ext_time;
    > > > u32 array[];
    > > > };
    > > > };
    > > >
    > > > That's a bit bigger, but keeps the event size in the data stream.

    > >
    > > I'm really not seeing what any of these proposals have on top of what
    > > Steve currently has. We have the ability to encode up to 28 bytes of
    > > payload in a 4 bytes header, which should suffice for most entries,
    > > right?
    > >

    >
    > The common entries would be under 28 bytes, right. But the thing is that
    > Steven's proposal uses 8 bytes for the _smallest_ event when we can in
    > fact cut that down to 4 bytes.


    Only for dataless events.

    >
    >
    > > > Also, nobody has explained successfully why we have to encode a time
    > > > _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
    > > > either I fail to see the light here (I doubt it), or I am not clear
    > > > enough when I say that we can just put the raw LSBs and compute the
    > > > delta afterward when reading the buffers, manage to keep the same
    > > > overflow detection power, and also keep the absolute value of the tsc
    > > > lsb, which makes it much easier to cross-check than "deltas".

    > >
    > > Still not quite understanding where you get the MSBs from, how do you
    > > tell if two LSBs are from the same period?
    > >

    >
    > See my answer to Steven for this.
    >
    > > > Now for the buffer pages implementation :
    > > >
    > > >
    > > > +struct buffer_page {
    > > > + union {
    > > > + struct {
    > > > + unsigned long flags; /* mandatory */
    > > > + atomic_t _count; /* mandatory */
    > > > + u64 time_stamp; /* page time stamp */
    > > >
    > > > Why should we ever associate a time stamp with a page ??

    > >
    > > Discarting the whole sub-buffer idea, it could be used to validate
    > > whichever time-stamp logic.
    > >
    > > Personally I don't particulary like the sub-buffer concept, and I don't
    > > think we need it.
    > >

    >
    > Depends on the use-cases I guess, and on the locking used. My question
    > about this time-stamp is : shouldn't we, instead, put it in a page
    > header (exposed to userspace) rather than in this page frame, which I
    > believe is in a separate data structure (and not in the page per se) ?


    I rather keep the page headers out of the buffer. Let the tracer expose
    it.

    >
    > > > I see that we could save the timestamp at which a subbuffer switch
    > > > happens (which in this patchset semantics happens to be a page), but why
    > > > would we every want to save that in the page frame ? Especially if we
    > > > simply write the LSBs instead of a time delta... Also, I would write
    > > > this timestamp in a subbuffer _header_ which is exported to userspace,

    > >
    > > Why have sub-buffers at all?
    > >

    >
    > To support events larger than 4kB. Also, to support different backing
    > storage for the buffers without keeping a limitation specific a single
    > page-tied implementation of those.


    Create something special for >page_size events. Those are not the norm
    anyway.

    >
    > > > but I clealry don't see why we keep it here. In addition, it's
    > > > completely wrong in a layered approach : if we want to switch from pages
    > > > to video memory or to a statically allocated buffer at boot time, such
    > > > "page-related" timestamp hacks won't do.

    > >
    > > I don't think you _ever_ want to insert actual video memory in the
    > > trace, what you _might_ possibly want to do, is insert a copy of a
    > > frame, but that you can do with paged buffers like we have, just add an
    > > entry with 3840*1200*4 bytes (one screen in my case), and use sub-writes
    > > to copy everything to page-alinged chunks.
    > >

    >
    > No, no What I meant is : some people would like to _reserve_ part of
    > their video card memory so they can use it as backing storage for the
    > _buffers_. The benefit of doing that is that those buffers will survive
    > a hot reboot. Rather useful to debug kernel crashes.


    This sounds like something special, and can be done with something
    different. No, I don't want the ring buffer infrastructure to be complex
    mechanism that can do everything. It should be a generic infrastructure
    that does what 99% of the tracers want to do. And 100% of those tracers
    want to record something smaller than a page size.

    >
    > Copying larger event payloads would actually require to reserve multiple
    > events in the trace, and would require a cookie or some locking to tell
    > that a given event is actually the follow-up of the previous one,
    > because with more evolved locking schemes, each event space reservation
    > is atomic and we can therefore not assume that a single payload splitted
    > in two events will be put contiguously in the trace buffers (can be
    > interrupted, preempted...).


    Again, make a special tracer for this. This is not the common case, and
    I really don't care about it.

    >
    > > There is nothing techinically prohibiting this in Steve's scheme, except
    > > for Linus telling you you're an idiot for doing it.
    > >
    > > The NOP packets you dislike allow us to align regular entries with page
    > > boundaries, which in turn allows this 0-copy stuff. If you use the
    > > sub-write iterator stuff we taked about a while ago, you can leave out
    > > the NOPs.
    > >
    > > Having both makes sense (if you want the large packets stuff), use the
    > > regular page aligned 0-copy stuff for regular small packets, and use the
    > > sub-write iterator stuff for your huge packets.
    > >

    >
    > Sorry, I don't understand what the NOP packet gives you that the used
    > bytes amount in the subbuffer (page) header doesn't provide (given my
    > explanation above).
    >
    >
    > > > > As for the page-spanning entries, I think we can do that with Steve's
    > > > > system just fine, its just that Linus said its a dumb idea and Steve
    > > > > dropped it, but there is nothing fundamental stopping us from recording
    > > > > a length that is > PAGE_SIZE and copying data into the pages one at a
    > > > > time.
    > > > >
    > > > > Nor do I see it impossible to implement splice on top of Steve's
    > > > > ring-buffer..
    > > > >
    > > > > So again, why?
    > > > >
    > > >
    > > > I'd like to separate the layer which deals with data read/write from the
    > > > layer which deals with synchronization of data write/read to/from the
    > > > buffers so we can eventually switch to a locking mechanism which
    > > > provides a sane performance level, and given Steven's linked list
    > > > implementation, it will just add unneeded locking requirements.

    > >
    > > How does it differ from your linked list implementation? Reaslistically,
    > > we want but a single locking scheme for the trace buffer stuff. So
    > > factoring it out doesn't really make sense.
    > >

    >
    > My linked list implementation does not assume we have to do a "disable
    > interrupts" around the whole pointer manipulation, simply because there
    > isn't any complicated manipulation done. Actually, the only linked list
    > pointer manipulation I need to do a to atomically save one of three
    > pointers to cache the current page I am writing to/reading from/getting
    > the header pointer from.
    >
    > > > Compared to this, I deal with buffer coherency with two 32 bits counters
    > > > in LTTng : a write offset and a consumer offset. (plus a per-subbuffer
    > > > commit count).

    > >
    > > I think that can work just fine on top of Steve's stuff too, it needs a
    > > bit of trimming etc.. but afaict there isn't anything stopping us from
    > > implementing his reserve function as light as you want:
    > >
    > > int reserve(buffer, size, flags)
    > > {
    > > preempt_disable()
    > > cpu = smp_processor_id();
    > > cpu_buf = buffer->buffers[cpu];
    > >
    > > if (cpu_buf->stop) {
    > > ret = -EBUSY;
    > > goto out;
    > > }
    > >
    > > again:
    > > pos = cpu_buf->write_pos;
    > > if (flags & CONTIG) {
    > > new_pos = pos + size;
    > > } else {
    > > if (size > PAGE_SIZE) {
    > > ret = -ENOSPACE;
    > > goto out;
    > > }
    > > if ((pos + size) & PAGE_MASK != pos & PAGE_MASK) {
    > > insert_nops();
    > > }
    > > new_pos = pos + size;
    > > }
    > > if (cmpxchg(&cpu_buf->write_pos, pos, new_pos) != pos)

    >
    > How do you deal with his call to __rb_reserve_next which deals with page
    > change and plays with tail_page pointer ? This is where the whole
    > problem lays; you have to atomically update more than a single atomic
    > variable.



    Only need to handle writes in stack order (interrupts and nmis). I'll move
    the tail index into the page itself, and then we can do atomic operations
    on both uping the tail and moving the tail page pointer. This will work
    nicely.


    >
    > > goto again;
    > > return pos;
    > >
    > > out:
    > > preempt_enable();
    > > return ret;
    > > }
    > >
    > > If you put the offset&PAGE_MASK in each page-frame you can use that to
    > > easily detect when you need to flip to the next page.
    > >

    >
    > Exactly what I have in this patch.
    >
    > > Which I imagine is similar to what you have... although I must admit to
    > > not being sure how to deal with over-write here, I guess your buffer
    > > must be large enough to ensure no nesting depth allows you to
    > > wrap-around while having an even un-commited.
    > >

    >
    > I deal with overwrite by looking that the commit count value of the
    > subbuffer (page) I am about to start writing into. If it's not a


    Heh, I was thinking of doing the same thing in the page frame. Sounds
    like this is growing into what you did. See, I told you that it will.
    Just wait for v2 ;-)

    > multiple of subbuffer (page) size, then I consider the content of that
    > specific subbuffer as corrupted. And yes, it implies that nesting that
    > wraps around the number of subbuffers would cause (detected) data
    > corruption. If it ever happens, the corrupted subbuffers count will be
    > incremented and the given subbuffer will be overwritten (the subbuffer
    > commit count is reequilibrated at this moment). When the stale writer
    > will resume its execution, it will increment the commit count and
    > therefore cause a second subbuffer corruption. In the end, 2 subbuffers
    > will be dropped if this kind of situation happens. But I guess the rule
    > is to make sure nested writes does not overflow the complete buffer
    > size, or to simply make the buffers large enough.


    Actually, I plan on pushing all the work to the reader. The writer will
    have full speed. I'm not going to worry about it now, I'm pretty sure we
    have the infrastructure in place to make it happen. Another reason I don't
    want the ring buffer to interface with the user land directly is because
    it lets us change it without worrying.

    Remember, there is no internal kernel backward compatibility ;-)

    -- Steve

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    > A 4 byte dataless payload is useless anyway.

    Not at all convinced that's true - we used it for lots of things.
    Start and end of irq event is one frequent example.

    Also, in a compact mode, we can record start and end of
    syscalls like this (without parameters).
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    On Tue, Sep 30, 2008 at 10:22 AM, Martin Bligh wrote:
    >> A 4 byte dataless payload is useless anyway.

    >
    > Not at all convinced that's true - we used it for lots of things.
    > Start and end of irq event is one frequent example.
    >
    > Also, in a compact mode, we can record start and end of
    > syscalls like this (without parameters).


    Ah, sorry, maybe I misread this. No data with no event ids, etc
    is fairly useless. 4 bytes data including space to record event ids is OK.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    * Martin Bligh (mbligh@google.com) wrote:
    > On Tue, Sep 30, 2008 at 10:22 AM, Martin Bligh wrote:
    > >> A 4 byte dataless payload is useless anyway.

    > >
    > > Not at all convinced that's true - we used it for lots of things.
    > > Start and end of irq event is one frequent example.
    > >
    > > Also, in a compact mode, we can record start and end of
    > > syscalls like this (without parameters).

    >
    > Ah, sorry, maybe I misread this. No data with no event ids, etc
    > is fairly useless. 4 bytes data including space to record event ids is OK.


    In Steven's scheme, the event IDs in the 4 bytes are reserved for
    (useless) internal use They can therefore not be used for specific
    tracer event IDs, which I think is a misuse of the precious bits
    otherwise available to store really useful event IDs.

    Mathieu

    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [RFC PATCH] LTTng relay buffer allocation, read, write


    On Tue, 30 Sep 2008, Mathieu Desnoyers wrote:
    >
    > In Steven's scheme, the event IDs in the 4 bytes are reserved for
    > (useless) internal use They can therefore not be used for specific
    > tracer event IDs, which I think is a misuse of the precious bits
    > otherwise available to store really useful event IDs.


    I'm using them, so they must not be totally useless. ;-)

    But ftrace has its own event ids and I don't want the ring buffer to ever
    have to know about them.

    -- Steve

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    * Steven Rostedt (rostedt@goodmis.org) wrote:
    >
    > On Tue, 30 Sep 2008, Mathieu Desnoyers wrote:
    > >
    > > In Steven's scheme, the event IDs in the 4 bytes are reserved for
    > > (useless) internal use They can therefore not be used for specific
    > > tracer event IDs, which I think is a misuse of the precious bits
    > > otherwise available to store really useful event IDs.

    >
    > I'm using them, so they must not be totally useless. ;-)
    >
    > But ftrace has its own event ids and I don't want the ring buffer to ever
    > have to know about them.
    >
    > -- Steve
    >


    You are actually using them to put redundant information that could be
    encoded differently and thus save 4 bits per event records, more or less
    what will be needed by most tracers (15 IDs, 1 reserved for an extended
    ID field).

    So the fact that you use them does not mean they are really required,
    and I don't think such duplicated information actually makes things more
    solid. Maybe just more obscure ?

    Mathieu

    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [RFC PATCH] LTTng relay buffer allocation, read, write


    On Tue, 30 Sep 2008, Mathieu Desnoyers wrote:
    >
    > You are actually using them to put redundant information that could be
    > encoded differently and thus save 4 bits per event records, more or less
    > what will be needed by most tracers (15 IDs, 1 reserved for an extended
    > ID field).


    I really like the idea of keeping the tracer event ids out of the ring
    buffer logic.

    >
    > So the fact that you use them does not mean they are really required,
    > and I don't think such duplicated information actually makes things more
    > solid. Maybe just more obscure ?


    Well, at least for version 1 these bits stay. I already found two bugs by
    hitting the event padding when the size said it should not have. This
    redundant information makes me feel a bit more cozy when they match.

    -- Steve

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    > You are actually using them to put redundant information that could be
    > encoded differently and thus save 4 bits per event records, more or less
    > what will be needed by most tracers (15 IDs, 1 reserved for an extended
    > ID field).


    You have 15 event types that are useful with no data payload at all?

    > So the fact that you use them does not mean they are really required,
    > and I don't think such duplicated information actually makes things more
    > solid. Maybe just more obscure ?


    This is all over 1 bit of information, right? Since you need at least 1 for
    the timestamp stuff.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [RFC PATCH] LTTng relay buffer allocation, read, write

    * Martin Bligh (mbligh@google.com) wrote:
    > > You are actually using them to put redundant information that could be
    > > encoded differently and thus save 4 bits per event records, more or less
    > > what will be needed by most tracers (15 IDs, 1 reserved for an extended
    > > ID field).

    >
    > You have 15 event types that are useful with no data payload at all?
    >


    I am not saying anything about the actual number of events with 0 bytes
    payload I actually have in my own instrumentation, if this is what you
    mean. I am just saying that it leaves this room available for such
    events.

    Even if there is a 32 bits payload associated with those events, the
    fact that we can encode the event ID in the 32 bits header will bring
    those events from 96 bits (due to 32 bits alignment) down to 64 bits.

    > > So the fact that you use them does not mean they are really required,
    > > and I don't think such duplicated information actually makes things more
    > > solid. Maybe just more obscure ?

    >
    > This is all over 1 bit of information, right? Since you need at least 1 for
    > the timestamp stuff.


    4 bits of information could be added to the 32-bits header if we allow
    tracers to register their first 15 event IDs in those 4 bits.

    But well... let's keep that for v2.

    Mathieu

    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 2 1 2 LastLast