System call instrumentation - Kernel

This is a discussion on System call instrumentation - Kernel ; Hi Ingo, I looked at the system call instrumentation present in LTTng lately. I tried different solutions, e.g. hooking a kernel-wide syscall trace in do_syscall_trace, but it appears that I ended up re-doing another syscall table, which consists of specialized ...

+ Reply to Thread
Results 1 to 10 of 10

Thread: System call instrumentation

  1. System call instrumentation

    Hi Ingo,

    I looked at the system call instrumentation present in LTTng lately. I
    tried different solutions, e.g. hooking a kernel-wide syscall trace in
    do_syscall_trace, but it appears that I ended up re-doing another
    syscall table, which consists of specialized functions which extracts
    the string and data structure parameters from user-space. Since code
    duplication is not exactly wanted, I think that the original approach
    taken in my patchset, which is to instrument the kernel code at the
    sys_* level (e.g. sys_open), which is the earliest level where the
    parameter information is made available to the kernel, is still the best
    way to go.

    I would still identify the execution mode changes in the same way I do
    currently, which is by instrumenting do_syscall_trace, just to know as
    soon as possible when the mode has changed from user-space to
    kernel-space so we can do time accounting more accurately. I already
    have the patchset which adds the KERNEL_TRACE thread flag to every
    architectures. It's tested in assembly in the same way SYSCALL_TRACE is
    tested, but is activated globally by iterating on all the threads.

    So, the currently proposed scheme for a system call would be (for the
    open() example)

    shown as :
    kernel stack
    trace: event name (parameters)


    do_syscall_trace()
    trace: kernel_arch_syscall_entry (syscall id, instruction pointer)

    do_sys_open()
    trace: fs_open (fd, filename)

    do_syscall_trace()
    kernel_arch_syscall_exit (return value)

    If we take this open() example, filename is ready only in do_sys_open,
    which is called by sys_open and sys_openat. So the logical
    instrumentation site for this would really be do_sys_open(). The
    information about which system call has been done is made available in
    the kernel_arch_syscall_entry event. It is not present anymore at the
    do_sys_open level because this execution path can be called from more
    than one syscall.

    What do you think of this approach ?


    Mathieu

    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: System call instrumentation


    * Mathieu Desnoyers wrote:

    > Hi Ingo,
    >
    > I looked at the system call instrumentation present in LTTng lately. I
    > tried different solutions, e.g. hooking a kernel-wide syscall trace in
    > do_syscall_trace, but it appears that I ended up re-doing another
    > syscall table, which consists of specialized functions which extracts
    > the string and data structure parameters from user-space. Since code
    > duplication is not exactly wanted, I think that the original approach
    > taken in my patchset, which is to instrument the kernel code at the
    > sys_* level (e.g. sys_open), which is the earliest level where the
    > parameter information is made available to the kernel, is still the
    > best way to go.


    hm, i'm not sure about this. I've implemented system call tracing in -rt
    [embedded in the latency tracer] and it only needed changes in entry.S,
    not in every system call site. Now, granted, that tracer was simpler
    than what LTTng tries to do, but do we _really_ need more complexity? A
    trace point that simply expresses:

    sys_call_event(int sysno, long param1, long param2, long param3,
    long param4, long param5, long param6);

    would be more than enough in 99% of the cases. If you want to extract
    the strings from the system call table, to make the filtering of these
    syscall events easier, do it during build time (by for example modifying
    the __SYSCALL() macros in unistd.h), instead of a parallel syscall
    table.

    OTOH, as long as it's just one simple nonintrusive line per syscall,
    it's not a big deal - as long as it only traces the parameters as they
    come from the syscall ABI - we wont change them anyway. I.e. hide the
    ugly string portion, just turn them into a simple:

    trace_sys_getpid();
    trace_sys_time(tloc);
    trace_sys_gettimeofday(tz, tv);

    (although even such a solution would still need a general policy level
    buy-in i guess - we dont want half of the syscalls traced, the other
    half objected to by maintainers. So it's either all or nothing.)

    and the question also arises: why not do this on a debuginfo level? This
    information can be extracted from the debuginfo. We could change
    'asmlinkage' to 'syscall_linkage' to make it clear which functions are
    syscalls.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: System call instrumentation

    * Ingo Molnar (mingo@elte.hu) wrote:
    >
    > * Mathieu Desnoyers wrote:
    >
    > > Hi Ingo,
    > >
    > > I looked at the system call instrumentation present in LTTng lately. I
    > > tried different solutions, e.g. hooking a kernel-wide syscall trace in
    > > do_syscall_trace, but it appears that I ended up re-doing another
    > > syscall table, which consists of specialized functions which extracts
    > > the string and data structure parameters from user-space. Since code
    > > duplication is not exactly wanted, I think that the original approach
    > > taken in my patchset, which is to instrument the kernel code at the
    > > sys_* level (e.g. sys_open), which is the earliest level where the
    > > parameter information is made available to the kernel, is still the
    > > best way to go.

    >
    > hm, i'm not sure about this. I've implemented system call tracing in -rt
    > [embedded in the latency tracer] and it only needed changes in entry.S,
    > not in every system call site. Now, granted, that tracer was simpler
    > than what LTTng tries to do, but do we _really_ need more complexity? A
    > trace point that simply expresses:
    >
    > sys_call_event(int sysno, long param1, long param2, long param3,
    > long param4, long param5, long param6);
    >


    That would work for all system calls that doesn't have parameters like
    "const char __user *filename".

    > would be more than enough in 99% of the cases.


    Hrm, not sure about that :

    grep asmlinkage include/linux/syscalls.h|wc -l
    321

    grep -e asmlinkage.*__user include/linux/syscalls.h | wc -l
    162

    So about 50% of system calls have to get pointers from userspace.


    > If you want to extract
    > the strings from the system call table, to make the filtering of these
    > syscall events easier, do it during build time (by for example modifying
    > the __SYSCALL() macros in unistd.h), instead of a parallel syscall
    > table.
    >


    The goal of my second syscall table was to call callbacks to extract the
    __user parameters, nothing else, really. I currently extract the system
    call function names by iterating on the real system call table and by
    using kallsyms (it could be turned into a built-time extraction since
    the system call table is static to the kernel core).

    > OTOH, as long as it's just one simple nonintrusive line per syscall,
    > it's not a big deal - as long as it only traces the parameters as they
    > come from the syscall ABI - we wont change them anyway. I.e. hide the
    > ugly string portion, just turn them into a simple:
    >
    > trace_sys_getpid();
    > trace_sys_time(tloc);
    > trace_sys_gettimeofday(tz, tv);
    >
    > (although even such a solution would still need a general policy level
    > buy-in i guess - we dont want half of the syscalls traced, the other
    > half objected to by maintainers. So it's either all or nothing.)
    >


    Yup. That's ok with me.


    > and the question also arises: why not do this on a debuginfo level? This
    > information can be extracted from the debuginfo. We could change
    > 'asmlinkage' to 'syscall_linkage' to make it clear which functions are
    > syscalls.


    Given we already have the syscall table, I am not sure we would need any
    supplementary debuginfo support. Plus, the syscall table gives us the
    mapping between the syscall number and the symbol of the system call
    function, which is exactly what we need. The only thing I dislike in my
    current approach is that I depend on kallsyms; it would be good to have
    the system call symbols without depending on it.

    Mathieu

    >
    > Ingo


    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: System call instrumentation


    * Mathieu Desnoyers wrote:

    > > hm, i'm not sure about this. I've implemented system call tracing in
    > > -rt [embedded in the latency tracer] and it only needed changes in
    > > entry.S, not in every system call site. Now, granted, that tracer
    > > was simpler than what LTTng tries to do, but do we _really_ need
    > > more complexity? A trace point that simply expresses:
    > >
    > > sys_call_event(int sysno, long param1, long param2, long param3,
    > > long param4, long param5, long param6);
    > >

    >
    > That would work for all system calls that doesn't have parameters like
    > "const char __user *filename".


    what's the problem with them? Do you want to follow those parameters?

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: System call instrumentation

    * Ingo Molnar (mingo@elte.hu) wrote:
    >
    > * Mathieu Desnoyers wrote:
    >
    > > > hm, i'm not sure about this. I've implemented system call tracing in
    > > > -rt [embedded in the latency tracer] and it only needed changes in
    > > > entry.S, not in every system call site. Now, granted, that tracer
    > > > was simpler than what LTTng tries to do, but do we _really_ need
    > > > more complexity? A trace point that simply expresses:
    > > >
    > > > sys_call_event(int sysno, long param1, long param2, long param3,
    > > > long param4, long param5, long param6);
    > > >

    > >
    > > That would work for all system calls that doesn't have parameters like
    > > "const char __user *filename".

    >
    > what's the problem with them? Do you want to follow those parameters?
    >
    > Ingo


    Ideally, I'd like to have this kind of high-level information :

    event name : kernel syscall
    syscall name : open
    arg1 (%s) : "somefile" <-----
    arg2 (%d) : flags
    arg3 (%d) : mode

    However, "somefile" has to be read from userspace. With the protection
    involved, it would cause a performance impact to read it a second time
    rather than tracing the string once it's been copied to kernel-space.

    Mathieu

    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: System call instrumentation


    * Mathieu Desnoyers wrote:

    > Ideally, I'd like to have this kind of high-level information :
    >
    > event name : kernel syscall
    > syscall name : open
    > arg1 (%s) : "somefile" <-----
    > arg2 (%d) : flags
    > arg3 (%d) : mode
    >
    > However, "somefile" has to be read from userspace. With the protection
    > involved, it would cause a performance impact to read it a second time
    > rather than tracing the string once it's been copied to kernel-space.


    performance is a secondary issue here, and copies are fast anyway _if_
    someone wants to trace a syscall. (because the first copy brings the
    cacheline into the cache, subsequent copies are almost for free compared
    to the first copy)

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: System call instrumentation

    Hi Ingo and Mathieu,

    Ingo Molnar wrote:
    > * Mathieu Desnoyers wrote:
    >
    >> Ideally, I'd like to have this kind of high-level information :
    >>
    >> event name : kernel syscall
    >> syscall name : open
    >> arg1 (%s) : "somefile" <-----
    >> arg2 (%d) : flags
    >> arg3 (%d) : mode
    >>
    >> However, "somefile" has to be read from userspace. With the protection
    >> involved, it would cause a performance impact to read it a second time
    >> rather than tracing the string once it's been copied to kernel-space.

    >
    > performance is a secondary issue here, and copies are fast anyway _if_
    > someone wants to trace a syscall. (because the first copy brings the
    > cacheline into the cache, subsequent copies are almost for free compared
    > to the first copy)


    I think that the code duplication is also an issue.
    If we introduce functions which copy userspace strings same as
    original syscall functions, we have to maintain similar codes.
    So, I think Mathieu's approach (separating syscall parameters from
    syscall entering event) is enough reasonable.

    BTW, it seems that using KERNEL_TRACE per thread flag might be a bit
    tricky to trace all threads. Out of curiosity:-), why did not you use a
    global percpu flag to do it?

    Thanks,

    --
    Masami Hiramatsu

    Software Engineer
    Hitachi Computer Products (America) Inc.
    Software Solutions Division

    e-mail: mhiramat@redhat.com


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: System call instrumentation

    * Ingo Molnar (mingo@elte.hu) wrote:
    >
    > * Mathieu Desnoyers wrote:
    >
    > > Ideally, I'd like to have this kind of high-level information :
    > >
    > > event name : kernel syscall
    > > syscall name : open
    > > arg1 (%s) : "somefile" <-----
    > > arg2 (%d) : flags
    > > arg3 (%d) : mode
    > >
    > > However, "somefile" has to be read from userspace. With the protection
    > > involved, it would cause a performance impact to read it a second time
    > > rather than tracing the string once it's been copied to kernel-space.

    >
    > performance is a secondary issue here, and copies are fast anyway _if_
    > someone wants to trace a syscall. (because the first copy brings the
    > cacheline into the cache, subsequent copies are almost for free compared
    > to the first copy)
    >
    > Ingo


    Hrm, a quick benchmark on my pentium 4 comparing a normal open() system
    call executed in a loop to a modified open() syscall which executes the
    lines added in the following patch adds 450 cycles to each open() system
    call. I added a putname/getname on purpose to see the cost of a second
    userspace copy and it's not exactly free.

    The normal getname correctly nested, re-using the string previously
    copied, should not suffer from that kind of performance hit. Also, given
    that the string would be copied only once from userspace, it would
    eliminate race scenarios where multithreaded applications could change
    the string underneath, so the kernel would trace a different string than
    the one being really used for the system call.

    However, strings are not the only userspace arguments passed to system
    calls. For all these other arguments, performance could be an issue as
    well as racy user-level data modification which would let the kernel
    trace a different paramenter than the one being used in the system call.

    For those two reasons, I think extracting these parameters could be
    faster/cleaner/safer if done in the system call function, where the
    parameters are already copied in kernel space.

    Mathieu


    Index: linux-2.6-lttng/fs/open.c
    ================================================== =================
    --- linux-2.6-lttng.orig/fs/open.c 2008-05-19 22:51:16.000000000 -0400
    +++ linux-2.6-lttng/fs/open.c 2008-05-19 23:11:07.000000000 -0400
    @@ -1043,6 +1043,8 @@ long do_sys_open(int dfd, const char __u
    int fd = PTR_ERR(tmp);

    if (!IS_ERR(tmp)) {
    + putname(tmp);
    + tmp = getname(filename);
    fd = get_unused_fd_flags(flags);
    if (fd >= 0) {
    struct file *f = do_filp_open(dfd, tmp, flags, mode);


    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: System call instrumentation

    On Mon, 19 May 2008 23:44:53 -0400
    Mathieu Desnoyers wrote:

    > * Ingo Molnar (mingo@elte.hu) wrote:
    > >
    > > * Mathieu Desnoyers wrote:
    > >
    > > > Ideally, I'd like to have this kind of high-level information :
    > > >
    > > > event name : kernel syscall
    > > > syscall name : open
    > > > arg1 (%s) : "somefile" <-----
    > > > arg2 (%d) : flags
    > > > arg3 (%d) : mode
    > > >
    > > > However, "somefile" has to be read from userspace. With the
    > > > protection involved, it would cause a performance impact to read
    > > > it a second time rather than tracing the string once it's been
    > > > copied to kernel-space.


    the audit subsystem already does all of this... why not use that??
    (And it goes through great lengths to do it securely)

    > >


    > Hrm, a quick benchmark on my pentium 4 comparing a normal open()
    > system call executed in a loop to a modified open() syscall which
    > executes the lines added in the following patch adds 450 cycles to
    > each open() system call. I added a putname/getname on purpose to see
    > the cost of a second userspace copy and it's not exactly free.


    copying twice does mean that if the user wants, he can cheat you. He
    can, in another thread, change the string under you. So say you're
    doing this for anti-virus purposes, he can make you scan one file and
    open another.


    The audit subsystem was carefully designed to avoid this trap... how
    about using that?
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: System call instrumentation

    * Arjan van de Ven (arjan@infradead.org) wrote:
    > On Mon, 19 May 2008 23:44:53 -0400
    > Mathieu Desnoyers wrote:
    >
    > > * Ingo Molnar (mingo@elte.hu) wrote:
    > > >
    > > > * Mathieu Desnoyers wrote:
    > > >
    > > > > Ideally, I'd like to have this kind of high-level information :
    > > > >
    > > > > event name : kernel syscall
    > > > > syscall name : open
    > > > > arg1 (%s) : "somefile" <-----
    > > > > arg2 (%d) : flags
    > > > > arg3 (%d) : mode
    > > > >
    > > > > However, "somefile" has to be read from userspace. With the
    > > > > protection involved, it would cause a performance impact to read
    > > > > it a second time rather than tracing the string once it's been
    > > > > copied to kernel-space.

    >
    > the audit subsystem already does all of this... why not use that??
    > (And it goes through great lengths to do it securely)
    >
    > > >

    >
    > > Hrm, a quick benchmark on my pentium 4 comparing a normal open()
    > > system call executed in a loop to a modified open() syscall which
    > > executes the lines added in the following patch adds 450 cycles to
    > > each open() system call. I added a putname/getname on purpose to see
    > > the cost of a second userspace copy and it's not exactly free.

    >
    > copying twice does mean that if the user wants, he can cheat you. He
    > can, in another thread, change the string under you. So say you're
    > doing this for anti-virus purposes, he can make you scan one file and
    > open another.
    >
    >
    > The audit subsystem was carefully designed to avoid this trap... how
    > about using that?


    Hrm, given tracing will have to grab __user * parameters passed to
    various system calls, not limited to strings, the getname/putname
    infrastructure would need to be expanded a lot. I doubt it's worth
    adding such complexity (copy to temporary memory buffers and reference
    counting) in those system calls to support kernel-wide tracing.

    On the other hand, adding a marker in the traced function, at a code
    location where the data copied into the kernel is accessible, won't add
    such complexity and will help to keep good locality of reference (the
    stack is meant to be a good cache-hot memory region). Because a dormant
    marker does not have a significant performance hit (actually, my
    benchmarks shows a small acceleration of the overall system, probably
    due to cache line code layout modifications), I think it's legitimate to
    add this kind of instrumentation in the existing kernel system call
    functions.

    Mathieu

    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread