[patch 1/4] x86: remove bad comment - Kernel

This is a discussion on [patch 1/4] x86: remove bad comment - Kernel ; Remove a comment that is no longer correct. The reconfiguration is done directly in __switch_to_xtra. Signed-off-by: Markus Metzger --- Index: linux-2.6-x86/arch/x86/kernel/process_64.c ================================================== ================= --- linux-2.6-x86.orig/arch/x86/kernel/process_64.c 2007-12-10 09:35:26.%N +0100 +++ linux-2.6-x86/arch/x86/kernel/process_64.c 2007-12-10 09:36:38.%N +0100 @@ -598,10 +598,6 @@ memset(tss->io_bitmap, 0xff, prev->io_bitmap_max); ...

+ Reply to Thread
Results 1 to 11 of 11

Thread: [patch 1/4] x86: remove bad comment

  1. [patch 1/4] x86: remove bad comment

    Remove a comment that is no longer correct. The reconfiguration is done directly in __switch_to_xtra.

    Signed-off-by: Markus Metzger
    ---

    Index: linux-2.6-x86/arch/x86/kernel/process_64.c
    ================================================== =================
    --- linux-2.6-x86.orig/arch/x86/kernel/process_64.c 2007-12-10 09:35:26.%N +0100
    +++ linux-2.6-x86/arch/x86/kernel/process_64.c 2007-12-10 09:36:38.%N +0100
    @@ -598,10 +598,6 @@
    memset(tss->io_bitmap, 0xff, prev->io_bitmap_max);
    }

    - /*
    - * Last branch recording recofiguration of trace hardware and
    - * disentangling of trace data per task.
    - */
    if (test_tsk_thread_flag(prev_p, TIF_BTS_TRACE_TS))
    ptrace_bts_take_timestamp(prev_p, BTS_TASK_DEPARTS);

    ---------------------------------------------------------------------
    Intel GmbH
    Dornacher Strasse 1
    85622 Feldkirchen/Muenchen Germany
    Sitz der Gesellschaft: Feldkirchen bei Muenchen
    Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
    Registergericht: Muenchen HRB 47456 Ust.-IdNr.
    VAT Registration No.: DE129385895
    Citibank Frankfurt (BLZ 502 109 00) 600119052

    This e-mail and any attachments may contain confidential material for
    the sole use of the intended recipient(s). Any review or distribution
    by others is strictly prohibited. If you are not the intended
    recipient, please contact the sender and delete all copies.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: x86, ptrace: support for branch trace store(BTS)


    [ Cc:-ed Alan Stern a'ka kwatch fame - Alan might be interested in this
    too. ]

    hi Markus,

    finally had time to take a closer look at the design of your
    Branch-Trace-Support-via-ptrace-on-x86 patchset. I think we'll need to
    work some more on general API issues.

    here's the current proposed API:

    + case PTRACE_BTS_MAX_BUFFER_SIZE:
    + case PTRACE_BTS_ALLOCATE_BUFFER:
    + case PTRACE_BTS_GET_BUFFER_SIZE:
    + case PTRACE_BTS_READ_RECORD:
    + case PTRACE_BTS_CONFIG:
    + case PTRACE_BTS_STATUS:

    i can see a couple of open questions:

    1) PTRACE_BTS_MAX_BUFFER_SIZE

    why is a trace buffer size limit visible to user-space? It's 4000
    entries right now:

    #define PTRACE_BTS_BUFFER_MAX 4000

    it would be more flexible if user-space could offer arbitrary sized
    buffers, which the kernel would attempt to mlock(). Then those pages
    could be fed to the BTS MSR. I.e. there should be no hard limit in the
    API, other than a natural resource limit.

    2) struct bts_struct

    the structure of it is hardwired:

    the basic unit is an array of bts_struct:

    +struct bts_struct {
    + enum bts_qualifier qualifier;
    + union {
    + /* BTS_BRANCH */
    + struct {
    + long from_ip;
    + long to_ip;
    + } lbr;
    + /* BTS_TASK_ARRIVES or
    + BTS_TASK_DEPARTS */
    + unsigned long long timestamp;
    + } variant;
    +};

    while other CPUs (on other architectures, etc.) might have a different
    raw format for the trace entries. So it would be better to implement BTS
    support in kernel/ptrace.c with a general trace format, and to provide a
    cross-arch API (only implemented by arch/x86 in the beginning) to
    translate the 'raw' trace entries into the general format.

    3)

    it would certainly be useful for some applications to have a large
    "virtual" trace buffer and a small "real" trace buffer. The kernel would
    use BTS high-watermark interrupts to feed the real trace buffer into the
    larger "virtual" trace buffer. This way we wouldnt even have to use
    mlock() to pin down the trace buffer.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. RE: x86, ptrace: support for branch trace store(BTS)

    >-----Original Message-----
    >From: Ingo Molnar [mailto:mingo@elte.hu]
    >Sent: Montag, 10. Dezember 2007 21:21


    >here's the current proposed API:
    >
    >+ case PTRACE_BTS_MAX_BUFFER_SIZE:
    >+ case PTRACE_BTS_ALLOCATE_BUFFER:
    >+ case PTRACE_BTS_GET_BUFFER_SIZE:
    >+ case PTRACE_BTS_READ_RECORD:
    >+ case PTRACE_BTS_CONFIG:
    >+ case PTRACE_BTS_STATUS:
    >
    >i can see a couple of open questions:
    >
    >1) PTRACE_BTS_MAX_BUFFER_SIZE
    >
    >why is a trace buffer size limit visible to user-space? It's 4000
    >entries right now:
    >
    > #define PTRACE_BTS_BUFFER_MAX 4000
    >
    >it would be more flexible if user-space could offer arbitrary sized
    >buffers, which the kernel would attempt to mlock(). Then those pages
    >could be fed to the BTS MSR. I.e. there should be no hard limit in the
    >API, other than a natural resource limit.



    That would be a variation on Andi's zero-copy proposal, wouldn't it?

    The user supplies the BTS buffer and the kernel manages DS.

    Andi further suggested a vDSO to interpret the data and translate the
    hardware format into a higher level user format.

    I take it that you would leave that inside ptrace.

    I need to look more into mlock. So far, I found a system call in
    /usr/include/sys/mman.h and two functions sys_mlock() and
    user_shm_lock() in the kernel. Is there a memory expert around who could
    point me to some interesting places to look at?

    Can we distinguish kernel-locked memory from user-locked memory?
    I could imagine a malicious user to munlock() the buffer he provided to
    ptrace.

    Is there a real difference between mlock()ing user memory and allocating
    kernel memory? There would be if we could page out mlock()ed memory when
    the user thread is not running. We would need to disable DS before
    paging out, and page in before enabling it. If we cannot, then kernel
    allocated memory would require less space in physical memory.



    >2) struct bts_struct
    >
    >the structure of it is hardwired:
    >
    >the basic unit is an array of bts_struct:
    >
    > +struct bts_struct {
    > + enum bts_qualifier qualifier;
    > + union {
    > + /* BTS_BRANCH */
    > + struct {
    > + long from_ip;
    > + long to_ip;
    > + } lbr;
    > + /* BTS_TASK_ARRIVES or
    > + BTS_TASK_DEPARTS */
    > + unsigned long long timestamp;
    > + } variant;
    > +};
    >
    >while other CPUs (on other architectures, etc.) might have a different
    >raw format for the trace entries. So it would be better to
    >implement BTS
    >support in kernel/ptrace.c with a general trace format, and to
    >provide a
    >cross-arch API (only implemented by arch/x86 in the beginning) to
    >translate the 'raw' trace entries into the general format.



    The bts_struct is already an abstraction from the raw hardware format.
    The hardware uses:

    struct bts_netburst {
    void* from;
    void* to;
    long :4;
    long predicted :1;
    };
    struct bts_core2 {
    u64 from;
    u64 to;
    u64 :64;
    };

    I encode timestamps in the raw hardware format. This puts the timestamps
    at the correct places and frees me from any efforts to correlate
    separate timestamp data and branch record data (which is getting ugly
    when the BTS buffer overflows).

    Now, arch/x86/kernel/ds.c already needs to abstract from the raw
    hardware format. I chose to add timestamps to that abstraction. This
    makes the DS interface general enough to be directly used by ptrace
    (causing some misunderstanding, I'm afraid).

    If new architectures would eliminate the extra word, I would have to
    merge the qualifier and the escape address, i.e. have some (probably the
    least significant) bits of the escape address encode the qualifier. The
    interface would remain stable.

    Other architectures would at least need to encode the source and
    destination address. Thus, we should have enough room to support
    timestamps for other architectures, as well. If they choose a more
    compact encoding, the interface might no longer be usable.


    I could instead use
    struct ds_bts {
    void* from;
    void* to;
    };
    for the DS interface and move timestamps and their encoding in the DS
    BTS format to ptrace.
    This would move the DS interface closer to the hardware (yet still
    abstract from details) and move anything extra to its users - currently
    only ptrace.

    On the down side, other users of DS (kernel tracing, for example) would
    have to invent their own method to encode additional information into
    the BTS buffer.

    I saw benefit in making the DS interface more powerful.


    >3)
    >
    >it would certainly be useful for some applications to have a large
    >"virtual" trace buffer and a small "real" trace buffer. The
    >kernel would
    >use BTS high-watermark interrupts to feed the real trace
    >buffer into the
    >larger "virtual" trace buffer. This way we wouldnt even have to use
    >mlock() to pin down the trace buffer.


    I see two ways this could be done wrt disentangling trace:
    1. keep a small per-thread buffer
    2. use a per-cpu buffer

    The first alternative would allow us to disentangle the per-thread trace
    easily by simply reconfiguring the DS on a context switch (similar to
    how it is currently done).

    The second alternative would require us to memcpy() the BTS buffer at
    each context switch. This would be a huge overhead at a performance
    critical place.

    I would therefore keep the per-thread buffer. This costs one mlock()ed
    or kalloc()ed page per traced thread.

    In addition to that, we need the (pageable) memory to hold the virtual
    buffer.
    Thus, it only pays off if the virtual buffer is significantly bigger
    than the real buffer (we're paying one page extra).


    Let's look at the interrupt routine.
    It would need to disable tracing, page in the memory, memcpy() the BTS
    buffer, and reenable tracing. This is again a big overhead and might
    take some time.

    AFAIK, the interrupt routine would schedule the work and return
    immediately. A kernel thread would later pick up the work and do the
    copying.

    The traced thread would need to be suspended until the copying is done
    (unless we have a table of buffers that we cycle through and thus can
    provide a fresh buffer to the thread immediately).

    Thus, its timeslice would effectively be reduced to the amount of work
    it gets done in a single 'real' BTS buffer.

    I am not familiar with the new scheduler and how this would impact
    (system) performance. I could imagine that either the traced task
    virtually starves or that it receives an unfairly big amount of system
    resources. But then, I'm only guessing. I'm sure you can give a precise
    answer.


    Overall, this looks like a lot more effort and a lot more things that
    could go wrong and bring down the entire system. Compared to an almost
    trivial and reasonably safe implementation.

    And it only starts paying off when the virtual buffer is getting
    reasonably big; up to a few pages, we should be better off simply
    mlock()ing the entire buffer.

    For use in debuggers to show the debuggee's execution trace to the user,
    I think that this is unnecessarily complicated. For all the use cases
    that I can think of today, a single page of memory should suffice.

    From my own experience (debugging compilers on XScale), by far the most
    interesting information is the last handful of branch records. I cannot
    say how far users would want to look back in the trace history or how
    far they can look back and still make some sense out of what they see.

    Besides, all this would be hidden behind the ptrace interface. We might
    add this functionality when there is demand.


    thanks and regards,
    markus.
    ---------------------------------------------------------------------
    Intel GmbH
    Dornacher Strasse 1
    85622 Feldkirchen/Muenchen Germany
    Sitz der Gesellschaft: Feldkirchen bei Muenchen
    Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
    Registergericht: Muenchen HRB 47456 Ust.-IdNr.
    VAT Registration No.: DE129385895
    Citibank Frankfurt (BLZ 502 109 00) 600119052

    This e-mail and any attachments may contain confidential material for
    the sole use of the intended recipient(s). Any review or distribution
    by others is strictly prohibited. If you are not the intended
    recipient, please contact the sender and delete all copies.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: x86, ptrace: support for branch trace store(BTS)


    * Metzger, Markus T wrote:

    > That would be a variation on Andi's zero-copy proposal, wouldn't it?
    >
    > The user supplies the BTS buffer and the kernel manages DS.
    >
    > Andi further suggested a vDSO to interpret the data and translate the
    > hardware format into a higher level user format.
    >
    > I take it that you would leave that inside ptrace.


    yeah - i think both zero-copy and vdso are probably overkill for this.

    On the highest level, there are two main usecases of BTS that i can
    think of: debugging [a user-space task crashes and developer would like
    to see the last few branches taken - possibly extended to kernel space
    crashes as well], and instrumentation.

    In the first use-case (debugging) zero-copy is just an unnecessary
    complication.

    In the second use-case (tracing, profiling, call coverage metrics), we
    could live without zero-copy, as long as the buffer could be made "large
    enough". The current 4000 records limit seems rather low (and arbitrary)
    and probably makes the mechanism unsuitable for say call coverage
    profiling purposes. There's also no real mechanism that i can see to
    create a guaranteed flow of this information between the debugger and
    debuggee (unless i missed something), the code appears to overflow the
    array, and destroy earlier entries, right? That's "by design" for
    debugging, but quite a limitation for instrumentation which might want
    to have a reliable stream of the data (and would like the originating
    task to block until the debugger had an opportunity to siphoon out the
    data).

    > I need to look more into mlock. So far, I found a system call in
    > /usr/include/sys/mman.h and two functions sys_mlock() and
    > user_shm_lock() in the kernel. Is there a memory expert around who
    > could point me to some interesting places to look at?


    sys_mlock() is what i meant - you could just call it internally from
    ptrace and fail the call if sys_mlock() returns -EPERM. This keeps all
    the "there's too much memory pinned down" details out of the ptrace
    code.

    > Can we distinguish kernel-locked memory from user-locked memory? I
    > could imagine a malicious user to munlock() the buffer he provided to
    > ptrace.


    yeah. Once mlock()-ed, you need to "pin it" via get_user_pages(). That
    gives a permanent reference count to those pages.

    > Is there a real difference between mlock()ing user memory and
    > allocating kernel memory? There would be if we could page out
    > mlock()ed memory when the user thread is not running. We would need to
    > disable DS before paging out, and page in before enabling it. If we
    > cannot, then kernel allocated memory would require less space in
    > physical memory.


    mlock() would in essence just give you an easy "does this user have
    enough privilege to lock this many pages" API. The real pinning would be
    done by get_user_pages(). Once you have those pages, they wont be
    swapped out.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. RE: x86, ptrace: support for branch trace store(BTS)

    >-----Original Message-----
    >From: Ingo Molnar [mailto:mingo@elte.hu]
    >Sent: Dienstag, 11. Dezember 2007 15:53



    >In the second use-case (tracing, profiling, call coverage metrics), we
    >could live without zero-copy, as long as the buffer could be
    >made "large
    >enough". The current 4000 records limit seems rather low (and
    >arbitrary)
    >and probably makes the mechanism unsuitable for say call coverage
    >profiling purposes.


    The limit is indeed arbitrary. It was requested by one of my internal
    reviewers. The rationale was that we do not want to allow ptrace users
    to request that an unlimited amount of memory be allocated in the
    kernel.

    Andi suggested to make this a sysctl.

    Would it be safe to drop the artificial limit and let the limit be the
    available memory?


    >There's also no real mechanism that i can see to
    >create a guaranteed flow of this information between the debugger and
    >debuggee (unless i missed something), the code appears to overflow the
    >array, and destroy earlier entries, right? That's "by design" for
    >debugging, but quite a limitation for instrumentation which might want
    >to have a reliable stream of the data (and would like the originating
    >task to block until the debugger had an opportunity to siphoon out the
    >data).


    That's correct as well. My focus is on debugging. And that's actually
    the most useful behavior in that case. I'm not sure what you mean with
    'instrumentation'.


    Regarding streaming data (the only real use case I can think of would be
    path profiling), I think I would rather allocate a bigger buffer and
    send a signal when the buffer gets full instead of having the kernel
    provide an interrupt service routine and copy the data from a smaller
    buffer into a bigger one.

    This would leave it to the tracing task which likely wants to write the
    data into a file if it really got too big to hold it in physical memory.
    We would need another command to reset the DS index, and probably one to
    provide the entire buffer at once, and add some configuration bits to
    say whether we want the signal or a cyclic buffer.


    For debugging, I think this extension is not relevant. Since it could be
    implemented as an extension of the current interface, I would leave it,
    for now.


    >> I need to look more into mlock. So far, I found a system call in
    >> /usr/include/sys/mman.h and two functions sys_mlock() and
    >> user_shm_lock() in the kernel. Is there a memory expert around who
    >> could point me to some interesting places to look at?

    >
    >sys_mlock() is what i meant - you could just call it internally from
    >ptrace and fail the call if sys_mlock() returns -EPERM. This keeps all
    >the "there's too much memory pinned down" details out of the ptrace
    >code.
    >
    >> Can we distinguish kernel-locked memory from user-locked memory? I
    >> could imagine a malicious user to munlock() the buffer he

    >provided to
    >> ptrace.

    >
    >yeah. Once mlock()-ed, you need to "pin it" via get_user_pages(). That
    >gives a permanent reference count to those pages.


    sys_mlock() eventually results in a call to get_user_pages().
    I have not found the corresponding put_page() call for munlock(),
    though.

    So you're suggesting to do an additional get_user_pages to prevent the
    user from munlock()ing the memory?
    So, if the pages are unlocked, they will still be pinned down and cannot
    further be unlocked?

    Something like:
    ptrace_allocate(struct task_struct *child, void *base, size_t size) {
    if (sys_mlock(base, size) < 0)
    return -EPERM;
    get_user_pages(child, ...., base, size, ....);
    }


    >> Is there a real difference between mlock()ing user memory and
    >> allocating kernel memory? There would be if we could page out
    >> mlock()ed memory when the user thread is not running. We

    >would need to
    >> disable DS before paging out, and page in before enabling it. If we
    >> cannot, then kernel allocated memory would require less space in
    >> physical memory.

    >
    >mlock() would in essence just give you an easy "does this user have
    >enough privilege to lock this many pages" API. The real
    >pinning would be
    >done by get_user_pages(). Once you have those pages, they wont be
    >swapped out.


    The actual physical memory consumption will be worse (or at best equal)
    compared to kalloc()ed memory, since we need to pin down entire pages,
    whereas kalloc() would allocate just the memory that is actually needed.

    Unless we would put_page() and get_user_pages() the pinned down pages on
    a context switch. But then, get_user_pages() calls cond_resched(), which
    might not be a good think to do during a context switch.


    Overall, I think that kalloc() is actually more memory efficient than
    mlock(). The only benefit of mlock() would be that we save some
    copy_to_user() in the debugger task. This should not be a big deal for
    debuggers, but it might become interesting for profilers.

    On the other hand, profilers would typically expect a stream of data and
    they would either want a terribly big buffer or store the data into a
    file. In both cases, they need to do a lot of copying, already.

    I would stay with kalloc() for both use cases.


    Regarding your other open, are you OK with the DS interface?


    Andi suggested in one of his first reviews that the configuration
    commands should be combined.
    Something like:
    struct bts_config {
    size_t buffer_size;
    unsigned int flags;
    }
    The buffer would be reallocated, if its size changed. The flags would be
    treated like the current CONFIG command to control last branch recording
    and timestamp recording. They would eventually be extended to control
    overflow behavior (signal vs. cyclic buffer).

    I would expect that the buffer_size does not change too often and that
    the flags might change more frequently. This interface would further be
    harder to extend in case the struct needs to be extended.
    On the up side, it rids us of two commands.


    thanks and regards,
    markus.
    ---------------------------------------------------------------------
    Intel GmbH
    Dornacher Strasse 1
    85622 Feldkirchen/Muenchen Germany
    Sitz der Gesellschaft: Feldkirchen bei Muenchen
    Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
    Registergericht: Muenchen HRB 47456 Ust.-IdNr.
    VAT Registration No.: DE129385895
    Citibank Frankfurt (BLZ 502 109 00) 600119052

    This e-mail and any attachments may contain confidential material for
    the sole use of the intended recipient(s). Any review or distribution
    by others is strictly prohibited. If you are not the intended
    recipient, please contact the sender and delete all copies.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: x86, ptrace: support for branch trace store(BTS)


    * Metzger, Markus T wrote:

    > Andi suggested to make this a sysctl.


    that's just as arbitrary ...

    > Would it be safe to drop the artificial limit and let the limit be the
    > available memory?


    no, that would be a DoS :-/

    mlock() is rlimit controlled and is available to unprivileged users - up
    to a small amount of memory can be locked down. But i agree that mlock()
    can be problematic - see below.

    > > There's also no real mechanism that i can see to create a guaranteed
    > > flow of this information between the debugger and debuggee (unless i
    > > missed something), the code appears to overflow the array, and
    > > destroy earlier entries, right? That's "by design" for debugging,
    > > but quite a limitation for instrumentation which might want to have
    > > a reliable stream of the data (and would like the originating task
    > > to block until the debugger had an opportunity to siphoon out the
    > > data).

    >
    > That's correct as well. My focus is on debugging. And that's actually
    > the most useful behavior in that case. I'm not sure what you mean with
    > 'instrumentation'.


    the branch trace can be used to generate a very finegrained
    profile/histogram of code execution - even of rarely executed codepaths
    which cannot be captured via timer/perf-counter based profiling.

    another potential use would be for call graph coverage testing. (which
    currently requires compiler-inserted calls - would be much nicer if we
    could do this via the hardware.)

    etc. Branch tracing isnt just for debugging i think - as long as the
    framework is flexible enough.

    > The actual physical memory consumption will be worse (or at best
    > equal) compared to kalloc()ed memory, since we need to pin down entire
    > pages, whereas kalloc() would allocate just the memory that is
    > actually needed.


    i agree that mlock() has problems. A different model would be: no mlock
    and no get_user_pages() - something quite close to what you have
    already. Data is streamed out of the internal (hardware-filled,
    kmalloc()-ed, user-inaccessible) buffer, we stop task execution until it
    is copied to the larger, user-provided buffer. The debugging feature you
    are interested in could be enabled as a special-case of this mechanism:
    if the user-space buffer is not larger than the hardware buffer then no
    streaming is needed, you can just capture into the kernel buffer.
    User-space would have to do a PTRACE_BTS_DRAIN_BUFFER call (or something
    like that) to get the "final portion" of the branch trace out into the
    user-space buffer. [which, in your debugging special-case, would the
    full internal buffer.]

    that way the kmalloc()-ed buffer becomes an internal detail of buffering
    that you rarely have to be aware of. (it could still be queried - like
    your patch does it now.)

    or something else that is intelligent. Basically, what we'd like to have
    is a future-proof, extensible approach that does not necessarily stop at
    debugging and integrates this hardware capability into Linux
    intelligently.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. RE: x86, ptrace: support for branch trace store(BTS)

    >-----Original Message-----
    >From: Ingo Molnar [mailto:mingo@elte.hu]
    >Sent: Mittwoch, 12. Dezember 2007 12:04


    >> Would it be safe to drop the artificial limit and let the

    >limit be the
    >> available memory?

    >
    >no, that would be a DoS :-/


    How about:

    for (; {
    int pid = fork();
    if (pid)
    ptrace_allocate(pid, );
    }


    I guess what we really want is a per-task kalloc() limit. That limit can
    be quite big.

    Should this rather be within kalloc() itself (with the data stored in
    task_struct) or should this be local to ptrace_bts_alloc?

    Is there already a per-task memory limit? We could deduct the kalloc()ed
    memory from that.


    >> the most useful behavior in that case. I'm not sure what you

    >mean with
    >> 'instrumentation'.

    >
    >the branch trace can be used to generate a very finegrained
    >profile/histogram of code execution - even of rarely executed
    >codepaths
    >which cannot be captured via timer/perf-counter based profiling.
    >
    >another potential use would be for call graph coverage testing. (which
    >currently requires compiler-inserted calls - would be much nicer if we
    >could do this via the hardware.)
    >
    >etc. Branch tracing isnt just for debugging i think - as long as the
    >framework is flexible enough.


    Agreed.


    >> The actual physical memory consumption will be worse (or at best
    >> equal) compared to kalloc()ed memory, since we need to pin

    >down entire
    >> pages, whereas kalloc() would allocate just the memory that is
    >> actually needed.

    >
    >i agree that mlock() has problems. A different model would be:


    I have my doubts regarding any form of two-buffer approach.

    Let's assume we find a way to get a reasonably big limit.

    Users who want to process that huge amount of data would be better off
    using a file-based approach (well, if it cannot be held in physical
    memory, they will spend most of their time swapping, anyway). Those
    users would typically wait for the 'buffer full' event and drain the
    buffer into a file - whether this is the real buffer or a bigger virtual
    buffer.

    The two-buffer approach would only benefit users who want to hold the
    full profile in memory - or who want to stall the debuggee until they
    processed or somehow compressed the data collected so far.
    Those approaches would not scale for very big profiles.
    The small profile cases would already be covered with a reasonably big
    real buffer.

    The two-buffer approach is moving out the limit of what can be profiled
    in (virtual) memory.


    Independent of how many buffers we have, we need to offer overflow
    handling. This could be:
    1. cyclic buffer
    2. buffer full event

    Robust users would either:
    a. select 2 and ignore the overflow signal to collect an initial profile
    b. select 1 to collect a profile tail
    c. select 2 and drain the real or virtual buffer into a file to collect
    the full profile
    d. select 2 and bail out on overflow (robust but limited usefulness)

    Debuggers would typically choose b. Path profilers would typically
    choose c. Student projects would typically choose d.


    I see benefit in providing a second overflow handling method.
    I only see benefit in a bigger second buffer if we cannot get the limit
    to a reasonable size.

    Am I missing something?


    thanks and regards,
    markus.
    ---------------------------------------------------------------------
    Intel GmbH
    Dornacher Strasse 1
    85622 Feldkirchen/Muenchen Germany
    Sitz der Gesellschaft: Feldkirchen bei Muenchen
    Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
    Registergericht: Muenchen HRB 47456 Ust.-IdNr.
    VAT Registration No.: DE129385895
    Citibank Frankfurt (BLZ 502 109 00) 600119052

    This e-mail and any attachments may contain confidential material for
    the sole use of the intended recipient(s). Any review or distribution
    by others is strictly prohibited. If you are not the intended
    recipient, please contact the sender and delete all copies.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: x86, ptrace: support for branch trace store(BTS)


    * Metzger, Markus T wrote:

    > Users who want to process that huge amount of data would be better off
    > using a file-based approach (well, if it cannot be held in physical
    > memory, they will spend most of their time swapping, anyway). Those
    > users would typically wait for the 'buffer full' event and drain the
    > buffer into a file - whether this is the real buffer or a bigger
    > virtual buffer.
    >
    > The two-buffer approach would only benefit users who want to hold the
    > full profile in memory - or who want to stall the debuggee until they
    > processed or somehow compressed the data collected so far. Those
    > approaches would not scale for very big profiles. The small profile
    > cases would already be covered with a reasonably big real buffer.


    well, the two-buffer approach would just be a general API with no
    limitations. It would make the internal buffer mostly a pure performance
    detail.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. RE: x86, ptrace: support for branch trace store(BTS)

    >-----Original Message-----
    >From: Ingo Molnar [mailto:mingo@elte.hu]
    >Sent: Donnerstag, 13. Dezember 2007 11:30


    >> Users who want to process that huge amount of data would be

    >better off
    >> using a file-based approach (well, if it cannot be held in physical
    >> memory, they will spend most of their time swapping, anyway). Those
    >> users would typically wait for the 'buffer full' event and drain the
    >> buffer into a file - whether this is the real buffer or a bigger
    >> virtual buffer.
    >>
    >> The two-buffer approach would only benefit users who want to

    >hold the
    >> full profile in memory - or who want to stall the debuggee

    >until they
    >> processed or somehow compressed the data collected so far. Those
    >> approaches would not scale for very big profiles. The small profile
    >> cases would already be covered with a reasonably big real buffer.

    >
    >well, the two-buffer approach would just be a general API with no
    >limitations. It would make the internal buffer mostly a pure
    >performance
    >detail.


    Agreed.


    Somewhat.

    A user-provided second buffer would need to be up-to-date when we switch
    to the user's task.
    We would either need to drain the real buffer when switching from the
    traced task;
    or we would need to drain the real buffers of all traced tasks when
    switching to the tracing task.
    Both would require a get_user_pages() during context switching.

    Alternatively, we could schedule a kernel task to drain the real buffer
    when switching from a traced task.
    The tracing task would then need to wait for all those kernel tasks. I'm
    not sure how that affects scheduling fairness. And it's getting quite
    complicated.


    A kernel-provided second buffer could be entirely hidden behind the
    ptrace (or, rather, ds) interface. It would not even have to be drained
    before switching to the tracing task, since ds would just look into the
    real buffer and then move on to the second buffer - transparent to the
    user. Its size could be deducted from the user's memory limit and it
    could be in pageable memory.

    We would not be able to give precise overflow signals, that way (the
    not-yet-drained real buffer might actually cause an overflow of the
    second buffer, when drained). By allowing the user to query for the
    number of BTS records to drain, we would not need to. A user drain would
    drain both bufers.

    The second buffer would be a pure performance/convenience detail of ds,
    just like you suggested.


    The ptrace API would allow the user to:
    - define (and query) the overflow mechanism
    (wrap-around or event)
    - define (and query) the size of the buffer within certain limits
    (we could either give an error or cut off)
    - define (and query) events to be monitored
    (last branch trace, scheduling timestamps)
    - get a single BTS record
    - query the number of BTS records
    (to find out how big your drain buffer needs to be; it may be bigger
    than you requested)
    - drain all BTS records (copy, then clear)
    - clear all BTS records

    Draining would require the user to allocate a buffer to hold the data,
    which might not be feasible when he is near his memory limit. He could
    fall back to looping over the single-entry get. It is questionable, how
    useful the drain ptrace command would actually be; we might want to
    replace it with a get range command.


    Are you OK with this?


    thanks and regards,
    markus.
    ---------------------------------------------------------------------
    Intel GmbH
    Dornacher Strasse 1
    85622 Feldkirchen/Muenchen Germany
    Sitz der Gesellschaft: Feldkirchen bei Muenchen
    Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
    Registergericht: Muenchen HRB 47456 Ust.-IdNr.
    VAT Registration No.: DE129385895
    Citibank Frankfurt (BLZ 502 109 00) 600119052

    This e-mail and any attachments may contain confidential material for
    the sole use of the intended recipient(s). Any review or distribution
    by others is strictly prohibited. If you are not the intended
    recipient, please contact the sender and delete all copies.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: x86, ptrace: support for branch trace store(BTS)


    * Metzger, Markus T wrote:

    > The ptrace API would allow the user to:
    > - define (and query) the overflow mechanism
    > (wrap-around or event)
    > - define (and query) the size of the buffer within certain limits
    > (we could either give an error or cut off)
    > - define (and query) events to be monitored
    > (last branch trace, scheduling timestamps)
    > - get a single BTS record
    > - query the number of BTS records
    > (to find out how big your drain buffer needs to be; it may be bigger
    > than you requested)
    > - drain all BTS records (copy, then clear)
    > - clear all BTS records
    >
    > Draining would require the user to allocate a buffer to hold the data,
    > which might not be feasible when he is near his memory limit. He could
    > fall back to looping over the single-entry get. It is questionable,
    > how useful the drain ptrace command would actually be; we might want
    > to replace it with a get range command.
    >
    > Are you OK with this?


    this sounds a lot more flexible to me. Please, once it looks good to all
    of us also extend LTP's ptrace bits with unit tests for these API
    additions. (Cc: such LTP bits to subrata@linux.vnet.ibm.com)

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. RE: x86, ptrace: support for branch trace store(BTS)

    >-----Original Message-----
    >From: Ingo Molnar [mailto:mingo@elte.hu]
    >Sent: Donnerstag, 13. Dezember 2007 14:08


    >> Are you OK with this?

    >
    >this sounds a lot more flexible to me. Please, once it looks
    >good to all


    I will implement the ptrace ABI and the man pages and send them out for
    review (I'll return -EOPNOTSUPP a few times) tomorrow.
    I'll add support for the overflow signal in a later patch, once I
    figured out how to do it.


    >of us also extend LTP's ptrace bits with unit tests for these API
    >additions. (Cc: such LTP bits to subrata@linux.vnet.ibm.com)


    I need to look into this. This will take some time, as well.

    regards,
    markus.
    ---------------------------------------------------------------------
    Intel GmbH
    Dornacher Strasse 1
    85622 Feldkirchen/Muenchen Germany
    Sitz der Gesellschaft: Feldkirchen bei Muenchen
    Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
    Registergericht: Muenchen HRB 47456 Ust.-IdNr.
    VAT Registration No.: DE129385895
    Citibank Frankfurt (BLZ 502 109 00) 600119052

    This e-mail and any attachments may contain confidential material for
    the sole use of the intended recipient(s). Any review or distribution
    by others is strictly prohibited. If you are not the intended
    recipient, please contact the sender and delete all copies.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread