[RFC] fix kallsyms to allow discrimination of local symbols - Kernel

This is a discussion on [RFC] fix kallsyms to allow discrimination of local symbols - Kernel ; The problem is that local symbols, being hidden from the linker, might not be unique. Thus, they don't make good anchors for the symbol relative addressing used by kprobes (it takes the first occurrence it finds). Likewise, when they appear ...

+ Reply to Thread
Results 1 to 11 of 11

Thread: [RFC] fix kallsyms to allow discrimination of local symbols

  1. [RFC] fix kallsyms to allow discrimination of local symbols

    The problem is that local symbols, being hidden from the linker, might
    not be unique. Thus, they don't make good anchors for the symbol
    relative addressing used by kprobes (it takes the first occurrence it
    finds). Likewise, when they appear in stack traces, it's sometimes not
    obvious which local symbol it is (although context usually allows an
    easy guess).

    Fix all of this by prefixing local symbols with the actual C file name
    they occur in separated by '|' (I had to use '|' since ':' is already in
    use for module prefixes in kallsyms lookups.

    I also had to rewrite mksysmap in perl because the necessary text
    formatting changes in shell are painfully slow.

    Comments?

    James

    ---

    diff --git a/Makefile b/Makefile
    index 6192922..a416b35 100644
    --- a/Makefile
    +++ b/Makefile
    @@ -685,7 +685,7 @@ quiet_cmd_vmlinux_version = GEN .version

    # Generate System.map
    quiet_cmd_sysmap = SYSMAP
    - cmd_sysmap = $(CONFIG_SHELL) $(srctree)/scripts/mksysmap
    + cmd_sysmap = $(PERL) $(srctree)/scripts/mksysmap

    # Link of vmlinux
    # If CONFIG_KALLSYMS is set .version is already updated
    @@ -759,7 +759,7 @@ endef

    # Generate .S file with all kernel symbols
    quiet_cmd_kallsyms = KSYM $@
    - cmd_kallsyms = $(NM) -n $< | $(KALLSYMS) \
    + cmd_kallsyms = $(NM) -n -l $< | sed "s|`pwd`/||" | $(KALLSYMS) \
    $(if $(CONFIG_KALLSYMS_ALL),--all-symbols) > $@

    .tmp_kallsyms1.o .tmp_kallsyms2.o .tmp_kallsyms3.o: %.o: %.S scripts FORCE
    diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
    index ad2434b..0badae2 100644
    --- a/scripts/kallsyms.c
    +++ b/scripts/kallsyms.c
    @@ -63,11 +63,12 @@ static inline int is_arm_mapping_symbol(const char *str)

    static int read_symbol(FILE *in, struct sym_entry *s)
    {
    - char str[500];
    + char str[500], file[500];
    char *sym, stype;
    - int rc;
    + int rc, c, line;

    - rc = fscanf(in, "%llx %c %499s\n", &s->addr, &stype, str);
    + file[0] = '\0';
    + rc = fscanf(in, "%llx %c %499s", &s->addr, &stype, str);
    if (rc != 3) {
    if (rc != EOF) {
    /* skip line */
    @@ -75,6 +76,12 @@ static int read_symbol(FILE *in, struct sym_entry *s)
    }
    return -1;
    }
    + c = fgetc(in);
    + if (c != '\n') {
    + rc = fscanf(in, "%499[^:]:%d\n", file, &line);
    + if (rc != 2)
    + file[0] = '\0';
    + }

    sym = str;
    /* skip prefix char */
    @@ -115,13 +122,22 @@ static int read_symbol(FILE *in, struct sym_entry *s)
    /* include the type field in the symbol name, so that it gets
    * compressed together */
    s->len = strlen(str) + 1;
    + if (islower(stype))
    + s->len += strlen(file) + 1;
    s->sym = malloc(s->len + 1);
    if (!s->sym) {
    fprintf(stderr, "kallsyms failure: "
    "unable to allocate required amount of memory\n");
    exit(EXIT_FAILURE);
    }
    - strcpy((char *)s->sym + 1, str);
    + if (islower(stype)) {
    + char *ss = (char *)s->sym + 1;
    +
    + strcpy(ss, file);
    + strcat(ss, "|");
    + strcat(ss, str);
    + } else
    + strcpy((char *)s->sym + 1, str);
    s->sym[0] = stype;

    return 0;
    diff --git a/scripts/mksysmap b/scripts/mksysmap
    index 6e133a0..496cadd 100644
    --- a/scripts/mksysmap
    +++ b/scripts/mksysmap
    @@ -1,4 +1,4 @@
    -#!/bin/sh -x
    +#!/usr/bin/perl
    # Based on the vmlinux file create the System.map file
    # System.map is used by module-init tools and some debugging
    # tools to retrieve the actual addresses of symbols in the kernel.
    @@ -41,5 +41,21 @@
    # so we just ignore them to let readprofile continue to work.
    # (At least sparc64 has __crc_ in the middle).

    -$NM -n $1 | grep -v '\( [aNUw] \)\|\(__crc_\)\|\( \$[adt]\)' > $2
    +chomp($cwd = `pwd`);
    +open(I, "nm -n -l $ARGV[0]|") || die;
    +open(O, ">$ARGV[1]") || die;
    +foreach() {
    + chomp;
    + ($addr, $type, $symbol, $file_and_line) = split(/[ ]/, $_);
    + next if ($type =~ /[aNUw]/ || $type =~ /\$[adt]/);
    + next if ($symbol=~ /__crc_/);
    + if ($type =~ /[a-z]/ && $file_and_line) {
    + ($_) = split(/:/, $file_and_line);
    + (undef, $file) = split(/^$cwd\//, $_);
    + $symbol = $file."|".$symbol;
    + }
    + print O "$addr $type $symbol\n";
    +}
    +
    +



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [RFC] fix kallsyms to allow discrimination of local symbols


    James Bottomley writes:

    > [...] Fix all of this by prefixing local symbols with the actual C
    > file name they occur in separated by '|' (I had to use '|' since ':'
    > is already in use for module prefixes in kallsyms lookups. [...]
    > Comments?


    Can we take some time to review how we got here?


    - You disprefer systemtap's use of an established, non-deprecated API
    for placing kernel probes. (We calculate addresses by a mixture of
    elf-analysis and runtime user-space lookup means. That's partly
    since kallsyms_lookup was unexported over our objections.) There is
    nothing outright broken (e.g. incorrect numbers) with what systemtap
    has been doing for years.

    - You argue that symbols+offset kprobing is better. We can see that,
    in some sense, but ...

    - I explain that we are used to final address calculating, as we'll
    have to do that regardless for user-space probes. Plus we need to
    work with kernels that predate the symbol+offset kprobe api
    extension. So this change would not simplify systemtap in any way.
    You do not respond.

    - I offer _stext+offset (for the kernel) and (.text*)+offset (for
    modules) kprobes: basically to use the "better" symbol+offset
    kprobes api, but use the same single reference addresses we already
    do, and leaving just the final addition to the kernel. You do not
    respond materially.

    - You argue that it cannot only be any symbol+offset ... but the actual
    nearest symbol+offset. But that doesn't work for local symbols. So
    you fix that to the nearest globally visible symbol+offset. But this
    requires:
    - yet more extra work and code from systemtap
    - extension to the kernel build system, and kallsyms runtime data to
    fix the current local-symbol-ambiguity problem
    - storage of all that new file name data in permanent unswappable
    kernel data (>>100kB, if done simply prefixing local symbol names
    file file names).
    - possible further complications related to filename string matching

    - You have yet to invent a scheme to allow offloading *data* address
    calculations to the kernel. Without that (and perhaps more),
    systemtap will *still* have to fetch same base _stext etc.
    addresses at run time that it currently does -- even if it did not
    use them to compute kprobes addresses.


    In total, this path would end up with both systemtap and the kernel
    more complex, larger and a bit slower too. Does that still seem an
    acceptable cost, just to get systemtap to change its preferred kprobes
    api?


    - FChE
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [RFC] fix kallsyms to allow discrimination of local symbols

    On Mon, 2008-07-21 at 21:44 -0400, Frank Ch. Eigler wrote:
    > James Bottomley writes:
    >
    > > [...] Fix all of this by prefixing local symbols with the actual C
    > > file name they occur in separated by '|' (I had to use '|' since ':'
    > > is already in use for module prefixes in kallsyms lookups. [...]
    > > Comments?

    >
    > Can we take some time to review how we got here?
    >
    >
    > - You disprefer systemtap's use of an established, non-deprecated API
    > for placing kernel probes. (We calculate addresses by a mixture of
    > elf-analysis and runtime user-space lookup means. That's partly
    > since kallsyms_lookup was unexported over our objections.) There is
    > nothing outright broken (e.g. incorrect numbers) with what systemtap
    > has been doing for years.


    You mean embedding half a megabyte of symbols simply so you can avoid
    the inconvenience of using a kernel API? yes, I think it's ...
    suboptimal.

    > - You argue that symbols+offset kprobing is better. We can see that,
    > in some sense, but ...
    >
    > - I explain that we are used to final address calculating, as we'll
    > have to do that regardless for user-space probes. Plus we need to
    > work with kernels that predate the symbol+offset kprobe api
    > extension. So this change would not simplify systemtap in any way.
    > You do not respond.


    There is no current userspace infrastructure, since utrace still isn't
    in the kernel, so you're predicating this argument on an event which
    hasn't happened.

    Even assuming utrace is accepted, embedding the symbol table of every
    user space process in the probes is still daft. It's this constant
    assertion that "it must be done my way" that's causing such a drag on
    the open source process. For instance, the obvious way to me of doing
    this would be to map the user space stack into the systemtap runtime and
    unwind it from there instead of vectoring it into the kernel.

    > - I offer _stext+offset (for the kernel) and (.text*)+offset (for
    > modules) kprobes: basically to use the "better" symbol+offset
    > kprobes api, but use the same single reference addresses we already
    > do, and leaving just the final addition to the kernel. You do not
    > respond materially.


    I thought this and subsequent emails addressed the points pretty well:

    http://marc.info/?l=linux-kernel&m=121632572409118

    > - You argue that it cannot only be any symbol+offset ... but the actual
    > nearest symbol+offset. But that doesn't work for local symbols. So
    > you fix that to the nearest globally visible symbol+offset. But this
    > requires:
    > - yet more extra work and code from systemtap


    I'm afraid that's how open source development works ... you iterate to
    find the best solution

    > - extension to the kernel build system, and kallsyms runtime data to
    > fix the current local-symbol-ambiguity problem


    Finding weaknesses in APIs and fixing them is what it's all about.

    > - storage of all that new file name data in permanent unswappable
    > kernel data (>>100kB, if done simply prefixing local symbol names
    > file file names).


    I'd check my facts before making assertions. The kernel symbol table is
    stored in a compressed form that actually eliminates most of these
    repetitions.

    > - possible further complications related to filename string matching


    Any substantiation of that?

    > - You have yet to invent a scheme to allow offloading *data* address
    > calculations to the kernel. Without that (and perhaps more),
    > systemtap will *still* have to fetch same base _stext etc.
    > addresses at run time that it currently does -- even if it did not
    > use them to compute kprobes addresses.


    That would be because I haven't actually started looking at this one
    yet. Of course, that would make it a great starting point for others
    who wished to help.

    > In total, this path would end up with both systemtap and the kernel
    > more complex, larger and a bit slower too.


    Really? I count the reduction of the probe modules from 500kb to 50kb a
    worthwhile saving. I don't even see where anything became larger.

    > Does that still seem an
    > acceptable cost, just to get systemtap to change its preferred kprobes
    > api?


    James


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [RFC] fix kallsyms to allow discrimination of local symbols

    Hi -

    On Mon, Jul 21, 2008 at 10:53:08PM -0500, James Bottomley wrote:
    > [...]
    > > - You disprefer systemtap's use of an established, non-deprecated API
    > > for placing kernel probes. [...]

    >
    > You mean embedding half a megabyte of symbols simply so you can avoid
    > the inconvenience of using a kernel API? yes, I think it's ...
    > suboptimal.


    It has been explained already that the symbol table you saw in
    stap-symbols.h has nothing to do with the kprobe addressing issue.

    http://www.gossamer-threads.com/list.../947365#957365


    > > - You argue that symbols+offset kprobing is better. We can see that,
    > > in some sense, but ...
    > >
    > > - I explain that we are used to final address calculating, as we'll
    > > have to do that regardless for user-space probes. Plus we need to
    > > work with kernels that predate the symbol+offset kprobe api
    > > extension. So this change would not simplify systemtap in any way.
    > > You do not respond.

    >
    > There is no current userspace infrastructure, since utrace still isn't
    > in the kernel, so you're predicating this argument on an event which
    > hasn't happened.


    We exercise professional foresight. And the backward compatibility
    issue remains even without that.


    > Even assuming utrace is accepted, embedding the symbol table of
    > every user space process in the probes is still daft. [....]


    It would take space, no question, though we're not talking about
    "every" process, just designated ones.

    > For instance, the obvious way to me of doing this would be to map
    > the user space stack into the systemtap runtime and unwind it from
    > there instead of vectoring it into the kernel.


    Please elaborate. What does mapping a stack into the runtime mean?
    Do you mean to suggest having the userspace program unwind itself? Or
    relying on the userspace programs' possibly-paged-out unwind data?
    That would be intrusive.


    > > - I offer _stext+offset (for the kernel) and (.text*)+offset (for
    > > modules) kprobes: basically to use the "better" symbol+offset
    > > kprobes api, but use the same single reference addresses we already
    > > do, and leaving just the final addition to the kernel. You do not
    > > respond materially.

    >
    > I thought this and subsequent emails addressed the points pretty well:
    >
    > http://marc.info/?l=linux-kernel&m=121632572409118


    No, they didn't. Every time I explained about how it does work, you
    just claimed "not", without even a single worked-out substantiating
    example.


    > [...]
    > > - storage of all that new file name data in permanent unswappable
    > > kernel data (>>100kB, if done simply prefixing local symbol names
    > > file file names).

    >
    > I'd check my facts before making assertions. The kernel symbol table is
    > stored in a compressed form that actually eliminates most of these
    > repetitions.


    A careful reader will notice the "if" in my sentence. Anyway, that
    or a superior compression scheme could apply to systemtap's various
    tables too.


    > > - possible further complications related to filename string matching

    >
    > Any substantiation of that?


    We have had reported problems with differences between kernels
    hand-built with long absolute source path names versus the smallest
    "kernel/foo.c" names. If such canonicalization takes place but
    inconsistently by the different tools, we will have a problem.


    > [...]
    > > In total, this path would end up with both systemtap and the kernel
    > > more complex, larger and a bit slower too.

    >
    > Really? I count the reduction of the probe modules from 500kb to 50kb a
    > worthwhile saving.


    The red clupea harengus again.

    > I don't even see where anything became larger.


    Even with ksymtab compression, there is still new data to be stored in
    the kernel, and it is extra for each systemtap probe datum.


    > > Does that still seem an acceptable cost, just to get systemtap to
    > > change its preferred kprobes api?


    > [no answer]


    Indeed.


    - FChE
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [RFC] fix kallsyms to allow discrimination of local symbols

    On Tue, 2008-07-22 at 07:51 -0400, Frank Ch. Eigler wrote:
    > Hi -
    >
    > On Mon, Jul 21, 2008 at 10:53:08PM -0500, James Bottomley wrote:
    > > [...]
    > > > - You disprefer systemtap's use of an established, non-deprecated API
    > > > for placing kernel probes. [...]

    > >
    > > You mean embedding half a megabyte of symbols simply so you can avoid
    > > the inconvenience of using a kernel API? yes, I think it's ...
    > > suboptimal.

    >
    > It has been explained already that the symbol table you saw in
    > stap-symbols.h has nothing to do with the kprobe addressing issue.
    >
    > http://www.gossamer-threads.com/list.../947365#957365


    You're confusing issues. I said embedding half a megabyte of symbol
    table that the kernel already has is a bad idea full stop. The ultimate
    think I'm looking to do is to evolve kernel APIs that makes this
    practice unnecessary.

    > > > - You argue that symbols+offset kprobing is better. We can see that,
    > > > in some sense, but ...
    > > >
    > > > - I explain that we are used to final address calculating, as we'll
    > > > have to do that regardless for user-space probes. Plus we need to
    > > > work with kernels that predate the symbol+offset kprobe api
    > > > extension. So this change would not simplify systemtap in any way.
    > > > You do not respond.

    > >
    > > There is no current userspace infrastructure, since utrace still isn't
    > > in the kernel, so you're predicating this argument on an event which
    > > hasn't happened.

    >
    > We exercise professional foresight. And the backward compatibility
    > issue remains even without that.


    No ... you're trying to constrain the open source process to a
    pre-conceived design which is unrealised by in-kernel code. This is
    directly producing an impasse.

    Backwards compatibility is the province of distros. The job of upstream
    is to produce the best tool and environment we can. After this, the
    backwards portability of desirable pieces can be discussed.

    > > Even assuming utrace is accepted, embedding the symbol table of
    > > every user space process in the probes is still daft. [....]

    >
    > It would take space, no question, though we're not talking about
    > "every" process, just designated ones.
    >
    > > For instance, the obvious way to me of doing this would be to map
    > > the user space stack into the systemtap runtime and unwind it from
    > > there instead of vectoring it into the kernel.

    >
    > Please elaborate. What does mapping a stack into the runtime mean?


    It means that the systemtap runtime and the process would share a
    mapping for the process stacks Obviously the process would have to be
    quiesced to poke about in it, but it obviates the need to vector
    megabytes of stack information through the kernel.

    > Do you mean to suggest having the userspace program unwind itself?


    It's possible ... but more likely that the stap runtime would do the
    unwinding. Which is more efficient won't really be known until someone
    actually tries coding it.

    > Or relying on the userspace programs' possibly-paged-out unwind data?
    > That would be intrusive.


    I think you'll find doing it in user space is an advantage for paged out
    data. It's much more complex to get to it in the kernel because you
    have to be careful of context while you're asking for it to be paged
    back in.

    > > > - I offer _stext+offset (for the kernel) and (.text*)+offset (for
    > > > modules) kprobes: basically to use the "better" symbol+offset
    > > > kprobes api, but use the same single reference addresses we already
    > > > do, and leaving just the final addition to the kernel. You do not
    > > > respond materially.

    > >
    > > I thought this and subsequent emails addressed the points pretty well:
    > >
    > > http://marc.info/?l=linux-kernel&m=121632572409118

    >
    > No, they didn't. Every time I explained about how it does work, you
    > just claimed "not", without even a single worked-out substantiating
    > example.


    Really? The mutability of _stext vs _text; the problem probing init
    sections I think they're real issues.

    > > [...]
    > > > - storage of all that new file name data in permanent unswappable
    > > > kernel data (>>100kB, if done simply prefixing local symbol names
    > > > file file names).

    > >
    > > I'd check my facts before making assertions. The kernel symbol table is
    > > stored in a compressed form that actually eliminates most of these
    > > repetitions.

    >
    > A careful reader will notice the "if" in my sentence. Anyway, that
    > or a superior compression scheme could apply to systemtap's various
    > tables too.
    >
    >
    > > > - possible further complications related to filename string matching

    > >
    > > Any substantiation of that?

    >
    > We have had reported problems with differences between kernels
    > hand-built with long absolute source path names versus the smallest
    > "kernel/foo.c" names. If such canonicalization takes place but
    > inconsistently by the different tools, we will have a problem.


    What it currently does is add tree relative names, so, for example this
    is a cut and paste from System.map:

    c0100000 T _text
    c0100000 T startup_32
    c0100054 t arch/x86/kernel/head_32.S|default_entry
    c01000a8 T startup_32_smp
    c010012a t arch/x86/kernel/head_32.S|checkCPUtype
    c01001ab t arch/x86/kernel/head_32.S|is486
    c01001b2 t arch/x86/kernel/head_32.S|is386
    c0100220 T initial_code
    c0100224 t arch/x86/kernel/head_32.S|check_x87

    There's no ambiguity about how the path is constructed.

    > > [...]
    > > > In total, this path would end up with both systemtap and the kernel
    > > > more complex, larger and a bit slower too.

    > >
    > > Really? I count the reduction of the probe modules from 500kb to 50kb a
    > > worthwhile saving.

    >
    > The red clupea harengus again.


    only if alio populo leges tantum adhiberent

    > > I don't even see where anything became larger.

    >
    > Even with ksymtab compression, there is still new data to be stored in
    > the kernel, and it is extra for each systemtap probe datum.


    I think the requirement that your huge data problem be solved at
    absolutely no cost is possibly a bit ambitious. However, it can be done
    at little cost to the kernel.

    > > > Does that still seem an acceptable cost, just to get systemtap to
    > > > change its preferred kprobes api?

    >
    > > [no answer]

    >
    > Indeed.


    Perhaps because that's the wrong question. It's not about trying to
    change a "preferred" API (and the concept of preferred means according
    to your preset design). It's about trying to get systemtap actually to
    engage with the kernel and iterate to an actual solution.

    James


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [RFC] fix kallsyms to allow discrimination of local symbols


    James Bottomley writes:

    > [...]
    >> > [...]
    >> > > - You disprefer systemtap's use of an established, non-deprecated API
    >> > > for placing kernel probes. [...]
    >> >
    >> > You mean embedding half a megabyte of symbols simply so you can avoid
    >> > the inconvenience of using a kernel API? yes, I think it's ...
    >> > suboptimal.

    >>
    >> It has been explained already that the symbol table you saw in
    >> stap-symbols.h has nothing to do with the kprobe addressing issue.
    >> [...]

    >
    > You're confusing issues. I said embedding half a megabyte of symbol
    > table that the kernel already has is a bad idea full stop.


    Be that as it may, but it is not appropriate to knowingly bring up a
    topic that is irrelevant to the patch series you're actually proposing.


    > The ultimate think I'm looking to do is to evolve kernel APIs that
    > makes this practice unnecessary.


    If by "this practice" you mean "stap-symbols.h tables", then you're
    worrying on the wrong area of code (anything in/near kprobes). I am
    happy to suggest more appropriate areas to help with that.


    >> > There is no current userspace infrastructure, since utrace still isn't
    >> > in the kernel, so you're predicating this argument on an event which
    >> > hasn't happened.

    >>
    >> We exercise professional foresight. And the backward compatibility
    >> issue remains even without that.

    >
    > No ... you're trying to constrain the open source process
    > to a pre-conceived design which is unrealised by in-kernel code.


    The "open source process" does not entail welcoming unnecessary changes.

    > This is directly producing an impasse.


    I don't recognize an impasse. This aspect of systemtap has been
    working just fine with the kernel as has been. We are not blocked on
    a resolution to the current issue.


    >> [...]
    >> > For instance, the obvious way to me of doing this would be to map
    >> > the user space stack into the systemtap runtime and unwind it from
    >> > there instead of vectoring it into the kernel.

    >>
    >> Please elaborate. What does mapping a stack into the runtime mean?

    >
    > It means that the systemtap runtime and the process would share a
    > mapping for the process stacks Obviously the process would have to be
    > quiesced to poke about in it, but it obviates the need to vector
    > megabytes of stack information through the kernel.


    I still don't understand. No one has seriously proposed "vectoring
    megabytes of stack information through the kernel", if by that you
    mean copying it all to userspace. Even frame-pointer-based dtrace
    doesn't do that.

    Accessing the process stack from systemtap modules is not the problem
    in the first place: rather, it's instant access to the unwind & symbol
    data needed to decode it.


    >> Do you mean to suggest having the userspace program unwind itself?

    >
    > It's possible ... but more likely that the stap runtime would do the
    > unwinding. Which is more efficient won't really be known until someone
    > actually tries coding it.


    It's not just a matter of efficiency but a matter of non-disruptiveness.


    >> Or relying on the userspace programs' possibly-paged-out unwind data?
    >> That would be intrusive.

    >
    > I think you'll find doing it in user space is an advantage for paged out
    > data. It's much more complex to get to it in the kernel because you
    > have to be careful of context while you're asking for it to be paged
    > back in.


    We would not ask for anything to be paged back in. That is one kind
    of disruption to system state that we strain to avoid.


    >> > > - I offer _stext+offset (for the kernel) and (.text*)+offset (for
    >> > > modules) kprobes [...]
    >> > I thought this and subsequent emails addressed the points pretty well:
    >> > http://marc.info/?l=linux-kernel&m=121632572409118

    >>
    >> No, they didn't. Every time I explained about how it does work, you
    >> just claimed "not", without even a single worked-out substantiating
    >> example.

    >
    > Really? The mutability of _stext vs _text; the problem probing init
    > sections I think they're real issues.


    Then please do me the courtesy of replying to the points already made
    on those items.

    (Recap: we don't care whether the vmlinux reference symbol is _text or
    _stext or something else. Whatever works, and so far _stext does
    everywhere we've needed it.)

    For init sections in general, they are irrelevant for systemtap at
    this time as kprobes blocks them, and we'd need extra kernel/module.c
    infrastructure to let us hook module-loaded init-not-yet-run events
    (and the corresponding exit transitions). Once those are solved,
    systemtap can trivially calculate .init addresses relative to _stext
    or whatever -- and it does today!)

    So, they do not appear to be real issues.


    >[...]
    >> We have had reported problems with differences between kernels
    >> hand-built with long absolute source path names versus the smallest
    >> "kernel/foo.c" names. If such canonicalization takes place but
    >> inconsistently by the different tools, we will have a problem.

    >
    > What it currently does is add tree relative names, so, for example this
    > is a cut and paste from System.map: [...]
    > There's no ambiguity about how the path is constructed.


    Perhaps, as long as the dwarf-parsing side canonicalizes them the same way.


    >> Even with ksymtab compression, there is still new data to be stored in
    >> the kernel, and it is extra for each systemtap probe datum.

    >
    > I think the requirement that your huge data problem be solved at
    > absolutely no cost is possibly a bit ambitious.


    What "huge data problem"? Forget the stap-symbols.h already.


    >> > > Does that still seem an acceptable cost, just to get systemtap to
    >> > > change its preferred kprobes api?
    >> > [no answer]

    >> Indeed.

    >
    > Perhaps because that's the wrong question. It's not about trying to
    > change a "preferred" API (and the concept of preferred means according
    > to your preset design).


    Preferred, as in working, stable, not in any way deprecated, backward
    compatible -- and more. It ain't broke and it don't need fixin'.


    > It's about trying to get systemtap actually to engage with the
    > kernel and iterate to an actual solution.


    It is unfortunate that this technical trivium wants to turn into jabs
    about "engaging with the kernel" or "constraining the open source
    process" or something. Let the idea stand on its own merits.


    - FChE
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [RFC] fix kallsyms to allow discrimination of local symbols

    On Tue, Jul 22, 2008 at 12:05:23PM -0400, Frank Ch. Eigler wrote:
    > > You're confusing issues. I said embedding half a megabyte of symbol
    > > table that the kernel already has is a bad idea full stop.

    >
    > Be that as it may, but it is not appropriate to knowingly bring up a
    > topic that is irrelevant to the patch series you're actually proposing.
    >
    > > The ultimate think I'm looking to do is to evolve kernel APIs that
    > > makes this practice unnecessary.

    >
    > If by "this practice" you mean "stap-symbols.h tables", then you're
    > worrying on the wrong area of code (anything in/near kprobes). I am
    > happy to suggest more appropriate areas to help with that.


    I think there is a certain amount of talking past each other which is
    going on here.

    Let me see if I can summarize what is going on. My understanding of
    the situation is as follows:

    1) Right now, when trying to resolve the address to install a kprobe,
    Systemtap is using the DWARF information from the vmlinux file and/or
    the module's debuginfo files.

    2) Kprobe can support setting probes based on raw addresses (which is
    what Systemtap is doing now) as well as a text symbol looked up from
    kallsyms plus an offset. The latter was introduced two years ago, but
    Systemtap does not take advantage of it.

    3) James is offended by the fact that Systemtap is not utilizing this
    interface, and has offerred up patches which adds the capability of
    using the symbol+offset feature of the kprobes interface. There seems
    to be some question as to whether his patch has all of the
    functionality of the existing code which determines addresses via
    pawing through DWARF headers (especially where a function may have
    been inlined one or more times) but presumably James' patches could be
    enhanced to deal with this issues, if he hasn't done so already.

    Am I missing anything?

    So the main arguments against this seems to be that it by itself this
    doesn't actually reduce any complexity or code in Systemtap because
    there are other places (kernel data segment, user space tracing if it
    ever manages to get merged into mainline, etc.) which needs to be able
    to paw through DWARF headers.

    James' argument that this Systemtap is cramming over half a megabyte
    is regarded by is somewhat of a red herring with respect to this
    specific patch, since systemtap does not calculate probe addresses at
    runtime. This 600k or so of symbol tables is being used for something
    else. (What, exactly?!?) James clearly in the long term wants to
    make this go away, which seems reasonable. He views changing how
    kprobes are placed is the first in a series of changes that he would
    like to make. So perhaps the resistance in accepting his first patch
    troubles him since it appears that Systemtap folks aren't willing to
    take his suggestions about how things should be improved.

    May I make a suggestion and not try to come to a conclusion about the
    big picture question for a moment and focus on the very short-term of
    whether it is better that when I implement a probe such as:

    probe module("ext4dev").function("ext4_fill_super")
    {
    printk("here am I!\n");
    }

    This should be done via passing a hard-coded address, or via using the
    kprobes function+offset facility? It would seem that there are
    advantages to James' patch all by itself, in that it will will work
    even if the debuginfo information for the ext4dev module can't be
    found, since the kallsyms information would be used instead. It also
    means that the resulting systemtap probe modules will be easier to
    make more kernel independent, since it won't be using a hard-coded
    address, but rather a symbolic name.

    So what is the good reason *not* to do things the way James has
    suggested? The kprobes kallsyms facility has been around for a while
    now. Is it that we need to make changes?

    Maybe if things are focused on somewhat more concrete questions, we
    can make progress.

    Regards,

    - Ted
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [RFC] fix kallsyms to allow discrimination of local symbols

    Hi, Ted -


    On Tue, Jul 22, 2008 at 09:48:04PM -0400, Theodore Tso wrote:

    > [...]
    > 1) Right now, when trying to resolve the address to install a kprobe,
    > Systemtap is using the DWARF information from the vmlinux file and/or
    > the module's debuginfo files.


    Or as a fallback, it can use textual symbol tables, such as a
    System.map or /proc/kallsyms dump.

    > 2) Kprobe can support setting probes based on raw addresses (which is
    > what Systemtap is doing now) as well as a text symbol looked up from
    > kallsyms plus an offset. The latter was introduced two years ago, but
    > Systemtap does not take advantage of it.


    Right.

    > 3) James is offended by the fact that Systemtap is not utilizing this
    > interface, and has offerred up patches which adds the capability of
    > using the symbol+offset feature of the kprobes interface. [...]
    >
    > Am I missing anything?


    I also proposed a compromise where systemtap would use the
    symbol+offset interface, but choose a single convenient symbol as base
    for all probes in a particular elf file (/section).


    > So the main arguments against this seems to be that it by itself
    > this doesn't actually reduce any complexity or code in Systemtap
    > because there are other places (kernel data segment, user space
    > tracing if it ever manages to get merged into mainline, etc.) which
    > needs to be able to paw through DWARF headers.


    Right.


    > James' argument that this Systemtap is cramming over half a megabyte
    > is regarded by is somewhat of a red herring with respect to this
    > specific patch, since systemtap does not calculate probe addresses
    > at runtime.


    Or more precisely, not from that table.

    > This 600k or so of symbol tables is being used for something else.
    > (What, exactly?!?)


    They are there to enable scripts to perform address-to-symbol and
    symbol-to-address mappings on demand. This comes up in several
    contexts, one of which is symbolic stack unwinding, another is a set
    of explicit utility functions that are/will be available.

    It is possible that through analysis of any particular script's
    contents, systemtap will be able to determine that this data is
    unused, and thus compile it out.


    > James clearly in the long term wants to make this go away, which
    > seems reasonable.


    Right, though this is an active development area whose ultimate
    costs/benefits we do not yet know.


    > He views changing how kprobes are placed is the first in a series of
    > changes that he would like to make.


    Right, keeping in mind that this change does not advance that prior
    goal.


    > So perhaps the resistance in accepting his first patch troubles him
    > since it appears that Systemtap folks aren't willing to take his
    > suggestions about how things should be improved.


    Actually, this is not James' first patch; I am grateful for several
    that are in systemtap already and others are in the queue. But point
    taken.


    > May I make a suggestion and not try to come to a conclusion about the
    > big picture question for a moment and focus on the very short-term of
    > whether it is better that when I implement a probe such as:
    >
    > probe module("ext4dev").function("ext4_fill_super")
    > {
    > printk("here am I!\n");
    > }


    Sure.


    > This should be done via passing a hard-coded address, or via using
    > the kprobes function+offset facility? It would seem that there are
    > advantages to James' patch all by itself, in that it will will work
    > even if the debuginfo information for the ext4dev module can't be
    > found, since the kallsyms information would be used instead.


    As a quality-of-implementation matter, systemtap checks at translation
    time that such probes make sense -- that "ext4_fill_super" even
    exists. (That is needed also to expand wildcards.) The only way it
    can do that is if it has dwarf or separate textual symbol table data
    (see above). Both of those carry addresses as well, so we might as
    well use them.


    > It also means that the resulting systemtap probe modules will be
    > easier to make more kernel independent, since it won't be using a
    > hard-coded address, but rather a symbolic name.


    This is true, but I think it would apply to only a small class of
    scripts. Such a script would have to access no $context variables
    (since those require dwarf location/type data), and cannot include
    statement probes at function interiors (since those would require
    dwarf-derived non-zero symbol offsets), and cannot use any other
    runtime facility that involves arbitrary symbol<->address lookups.

    Even then, kernel-independence in the sense of being able to copy and
    reuse a compiled probe module amongst separate kernel builds seems
    like a faint hope anyway, considering modversions and our own internal
    version-matching checks. But maybe there is an opportunity here.


    > So what is the good reason *not* to do things the way James has
    > suggested? The kprobes kallsyms facility has been around for a
    > while now. Is it that we need to make changes?


    One reason is that James' proposed code breaks systemtap for
    pre-kallsyms-kprobes kernels, and those kernels too that have
    kallsyms-kprobes but not the RFC'd new one that has the source file
    names encoded within them. It could instead use e.g. our autoconf*
    facility to generate code compatible with them all.

    Another reason is that it likely breaks systemtap for impending code
    for user-space by replacing rather than extending various internal
    data structures that deal with this.

    I have seen little sympathy expressed for either of these concerns,
    which means that the new code would primarily allay some offense (but
    not constitute a bug fix or "usability for kernel developers" matter),
    and leave us to pick up the pieces. We can't make a habit out of that
    sort of thing, but maybe as a one-off in the interest of mutual
    goodwill, we should work out a way to get it done.


    > Maybe if things are focused on somewhat more concrete questions, we
    > can make progress.


    Thank you.


    - FChE
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [RFC] fix kallsyms to allow discrimination of local symbols

    On Wed, Jul 23, 2008 at 12:16:12AM -0400, Frank Ch. Eigler wrote:
    > I also proposed a compromise where systemtap would use the
    > symbol+offset interface, but choose a single convenient symbol as base
    > for all probes in a particular elf file (/section).


    I guess I don't see the value of that over just using the address
    directly. James' point wasn't just to use the symbol+offset feature
    just for the sake of using it, but rather as a better way of
    specifying how to insert a probe into a kernel. For example, it may
    be that by allowing the kernel to have a bit more semantic knowledge
    of where a probe is going, it could more easily do various
    safety-related checks that can't be done if all it is given is a a raw
    address.

    > > probe module("ext4dev").function("ext4_fill_super")
    > > {
    > > printk("here am I!\n");
    > > }

    >
    > > This should be done via passing a hard-coded address, or via using
    > > the kprobes function+offset facility? It would seem that there are
    > > advantages to James' patch all by itself, in that it will will work
    > > even if the debuginfo information for the ext4dev module can't be
    > > found, since the kallsyms information would be used instead.

    >
    > As a quality-of-implementation matter, systemtap checks at translation
    > time that such probes make sense -- that "ext4_fill_super" even
    > exists. (That is needed also to expand wildcards.) The only way it
    > can do that is if it has dwarf or separate textual symbol table data
    > (see above). Both of those carry addresses as well, so we might as
    > well use them.


    True, though I'll note for modern kernels, with /proc/kallsyms, we
    should hopefully be able to do this (offset=0 probes) without DWARF
    headers. BTW, one of the things which I have wondered is whether
    DWARF was really the right approach after all, given how bloated and
    space-inefficient it seems to be. Needing to keep 600 megs of module
    debuginfo for each kernel that I build (and yes, I build a fairly
    complete module-rich kernel, but having 59 megs of modules expand into
    606 megs of debuginfo files is more than a little bit obnoxious.)

    Maybe the kernel could be one of the places where we experiment with
    using something like CTF. Apparently CTF is compact enough that the
    Solaris folks are willing to keep the CTF information for the
    currently booted kernel in unswappable kernel memory --- which I know
    is one of those things we wouldn't want to do with DWARF information!

    > I have seen little sympathy expressed for either of these concerns,
    > which means that the new code would primarily allay some offense (but
    > not constitute a bug fix or "usability for kernel developers" matter),
    > and leave us to pick up the pieces. We can't make a habit out of that
    > sort of thing, but maybe as a one-off in the interest of mutual
    > goodwill, we should work out a way to get it done.


    This may have been a communication disconnect. I suspect if the
    answer was, "please make the following changes and we will accept the
    patch", that James may have been willing to make the changes ---
    especially if it is for the sake of backwards compatibility with older
    kernels. Most kernel developers do care very much about backwards
    compatibility between userspace and older kernel versions. (Some of
    the folks who weren't as careful have been flamed pretty heavily about
    that; I'm still not a great fan of the sysfs interface from a
    userspace usability perspective, but we haven't had a major breakage
    there in quite a while.)

    So there is a big difference between "please do X, Y, and Z first and
    the patch would then be acceptable" versus "for reasons A, B and C
    this patch is totally unacceptable and in fact what you are trying to
    do is pointless".

    Regards,

    - Ted
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [RFC] fix kallsyms to allow discrimination of local symbols

    On Wed, 2008-07-23 at 12:25 -0400, Theodore Tso wrote:
    > On Wed, Jul 23, 2008 at 12:16:12AM -0400, Frank Ch. Eigler wrote:
    > > I also proposed a compromise where systemtap would use the
    > > symbol+offset interface, but choose a single convenient symbol as base
    > > for all probes in a particular elf file (/section).

    >
    > I guess I don't see the value of that over just using the address
    > directly. James' point wasn't just to use the symbol+offset feature
    > just for the sake of using it, but rather as a better way of
    > specifying how to insert a probe into a kernel. For example, it may
    > be that by allowing the kernel to have a bit more semantic knowledge
    > of where a probe is going, it could more easily do various
    > safety-related checks that can't be done if all it is given is a a raw
    > address.
    >
    > > > probe module("ext4dev").function("ext4_fill_super")
    > > > {
    > > > printk("here am I!\n");
    > > > }

    > >
    > > > This should be done via passing a hard-coded address, or via using
    > > > the kprobes function+offset facility? It would seem that there are
    > > > advantages to James' patch all by itself, in that it will will work
    > > > even if the debuginfo information for the ext4dev module can't be
    > > > found, since the kallsyms information would be used instead.

    > >
    > > As a quality-of-implementation matter, systemtap checks at translation
    > > time that such probes make sense -- that "ext4_fill_super" even
    > > exists. (That is needed also to expand wildcards.) The only way it
    > > can do that is if it has dwarf or separate textual symbol table data
    > > (see above). Both of those carry addresses as well, so we might as
    > > well use them.

    >
    > True, though I'll note for modern kernels, with /proc/kallsyms, we
    > should hopefully be able to do this (offset=0 probes) without DWARF
    > headers. BTW, one of the things which I have wondered is whether
    > DWARF was really the right approach after all, given how bloated and
    > space-inefficient it seems to be. Needing to keep 600 megs of module
    > debuginfo for each kernel that I build (and yes, I build a fairly
    > complete module-rich kernel, but having 59 megs of modules expand into
    > 606 megs of debuginfo files is more than a little bit obnoxious.)


    Yes ... the Sun solution to this was something called CTF which is
    basically compressed dwarf with duplicates eliminated (as you say
    below). We could go a long way with simple duplicate elimination in
    dwarf. For instance dwarf tends to deposit a definition of a structure
    wherever it's used. For popular structures (like struct task) this can
    lead to hundreds of duplicates in the debug info.

    However, as has been rightly pointed out, most of what people want is
    simple entry and exit notification on functions ... this we can do with
    kprobes without any dwarf at all (so it's something systemtap should be
    able to do). In theory, we can also get at the actual function call
    arguments without resorting to dwarf, although that requires someone
    actually knowing the order and type (without dwarf we can't cross check
    unless we do something like parse the kernel headers).

    Where it gets tricky is if you want access to variables in the middle of
    the routine. This is very useful to kernel developers and for static
    tracepoints. Note that the current markers project has a solution to
    this which could be done dwarfless ... the problem is that it has to
    perturb the optimiser at that point to spill the variables of interest
    to locations we know, so the price we pay is zero impact when disabled.
    The approach I was taking more or less relies on dwarf (or CTF) to be
    zero impact because the variable could be aynwhere ... including in a
    register or temporary stack location.

    > Maybe the kernel could be one of the places where we experiment with
    > using something like CTF. Apparently CTF is compact enough that the
    > Solaris folks are willing to keep the CTF information for the
    > currently booted kernel in unswappable kernel memory --- which I know
    > is one of those things we wouldn't want to do with DWARF information!


    I'm not sure we'd want to do it with CTF either ... the only reason to
    have it in the kernel is if the kernel wants to use it. Right at the
    moment with the dwarf, systemtap userspace is the only consumer, so it
    doesn't need to be in the kernel; I don't really see that changing with
    CTF. I think dtrace does it because the D runtime is in-kernel so some
    of the run time typing needs the CTF.

    James


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [RFC] fix kallsyms to allow discrimination of local symbols

    On Wed, Jul 23, 2008 at 12:40:26PM -0400, James Bottomley wrote:
    > I'm not sure we'd want to do it with CTF either ... the only reason to
    > have it in the kernel is if the kernel wants to use it. Right at the
    > moment with the dwarf, systemtap userspace is the only consumer, so it
    > doesn't need to be in the kernel; I don't really see that changing with
    > CTF. I think dtrace does it because the D runtime is in-kernel so some
    > of the run time typing needs the CTF.


    As far as I understand things, all of that is done in userspace and
    byte-compiled down to run in their simple DTrace VM. The reason for
    CTF being in the Open Slaris is for their kernel debugger (kmdb) and
    so that a crash dump can be guaranteed to be easily analyzed.[1]

    - Ted

    [1] http://blogs.sun.com/levon/entry/reducing_ctf_overhead

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/