[RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address) - Kernel

This is a discussion on [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address) - Kernel ; One of the big nasties of systemtap is the way it tries to embed virtually the entirety of the kernel symbol table in the probe modules it constructs. This is highly undesirable because it represents a subversion of the kernel ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 40

Thread: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

  1. [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    One of the big nasties of systemtap is the way it tries to embed
    virtually the entirety of the kernel symbol table in the probe modules
    it constructs. This is highly undesirable because it represents a
    subversion of the kernel API to gain access to unexported symbols. At
    least for kprobes, the correct way to do this is to specify the probe
    point by symbol and offset.

    This patch converts systemtap to use the correct kprobe
    symbol_name/offset pair to identify the probe location.

    This only represents a baby step: after this is done, there are at
    least three other consumers of the systemtap module relocation
    machinery:

    1. unwind information. I think the consumers of this can be
    converted to use the arch specific unwinders that already exist
    within the kernel
    2. systemtap specific functions that use kernel internals. This
    was things like get_cycles() but I think they all now use a
    sanctioned API ... need to check
    3. Access to unexported global variables used by the probes. This
    one is a bit tricky; the dwarf gives a probe the ability to
    access any variable available from the probed stack frame,
    including all globals. We could just make the globals off
    limits, but that weakens the value of the debugger.
    Alternatively, we could expand the kprobe API to allow probes
    access to named global variables (tricky to get right without
    effectively giving general symbol access). Thoughts?

    If you're going to try this out, you currently need to specify --kelf on
    the command line to tell systemtap to use the kernel elf to derive
    symbol names and offsets (it will just segfault without this ATM).

    James

    ---

    diff --git a/tapsets.cxx b/tapsets.cxx
    index 9037e15..a6a6dd3 100644
    --- a/tapsets.cxx
    +++ b/tapsets.cxx
    @@ -2306,13 +2306,15 @@ struct dwarf_derived_probe: public derived_probe
    const string& module,
    const string& section,
    Dwarf_Addr dwfl_addr,
    - Dwarf_Addr addr,
    + string symbol,
    + unsigned int offset,
    dwarf_query & q,
    Dwarf_Die* scope_die);

    string module;
    string section;
    - Dwarf_Addr addr;
    + string kprobe_symbol;
    + unsigned int kprobe_offset;
    bool has_return;
    bool has_maxactive;
    long maxactive_val;
    @@ -3260,9 +3262,18 @@ dwarf_query::add_probe_point(const string& funcname,

    if (! bad)
    {
    + struct module_info *mi = dw.mod_info;
    + if (!mi->sym_table)
    + mi->get_symtab(this);
    + struct symbol_table *sym_tab = mi->sym_table;
    + func_info *symbol = sym_tab->get_func_containing_address(addr);
    +
    sess.unwindsym_modules.insert (module);
    probe = new dwarf_derived_probe(funcname, filename, line,
    - module, reloc_section, addr, reloc_addr, *this, scope_die);
    + module, reloc_section, reloc_addr,
    + symbol->name,
    + (unsigned int)(addr - symbol->addr),
    + *this, scope_die);
    results.push_back(probe);
    }
    }
    @@ -4380,7 +4391,8 @@ dwarf_derived_probe:rintsig (ostream& o) const
    // function instances. This is distinct from the verbose/clog
    // output, since this part goes into the cache hash calculations.
    sole_location()->print (o);
    - o << " /* pc=0x" << hex << addr << dec << " */";
    + o << " /* pc=<" << kprobe_symbol << "+0x" << hex
    + << kprobe_offset << dec << "> */";
    printsig_nested (o);
    }

    @@ -4406,22 +4418,26 @@ dwarf_derived_probe::dwarf_derived_probe(const string& funcname,
    // NB: dwfl_addr is the virtualized
    // address for this symbol.
    Dwarf_Addr dwfl_addr,
    - // addr is the section-offset for
    - // actual relocation.
    - Dwarf_Addr addr,
    + // symbol is the closest known symbol
    + // and offset is the offset from the symbol
    + string symbol,
    + unsigned int offset,
    dwarf_query& q,
    Dwarf_Die* scope_die /* may be null */)
    : derived_probe (q.base_probe, new probe_point(*q.base_loc) /* .components soon rewritten */ ),
    - module (module), section (section), addr (addr),
    + module (module), section (section), kprobe_symbol(symbol),
    + kprobe_offset(offset),
    has_return (q.has_return),
    has_maxactive (q.has_maxactive),
    maxactive_val (q.maxactive_val)
    {
    // Assert relocation invariants
    +#if 0
    if (section == "" && dwfl_addr != addr) // addr should be absolute
    throw semantic_error ("missing relocation base against", q.base_loc->tok);
    if (section != "" && dwfl_addr == addr) // addr should be an offset
    throw semantic_error ("inconsistent relocation address", q.base_loc->tok);
    +#endif

    this->tok = q.base_probe->tok;

    @@ -4620,8 +4636,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
    // Let's find some stats for the three embedded strings. Maybe they
    // are small and uniform enough to justify putting char[MAX]'s into
    // the array instead of relocated char*'s.
    - size_t module_name_max = 0, section_name_max = 0, pp_name_max = 0;
    - size_t module_name_tot = 0, section_name_tot = 0, pp_name_tot = 0;
    + size_t pp_name_max = 0, pp_name_tot = 0;
    + size_t symbol_name_name_max = 0, symbol_name_name_tot = 0;
    size_t all_name_cnt = probes_by_module.size(); // for average
    for (p_b_m_iterator it = probes_by_module.begin(); it != probes_by_module.end(); it++)
    {
    @@ -4630,9 +4646,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
    size_t var##_size = (expr) + 1; \
    var##_max = max (var##_max, var##_size); \
    var##_tot += var##_size; } while (0)
    - DOIT(module_name, p->module.size());
    - DOIT(section_name, p->section.size());
    DOIT(pp_name, lex_cast_qstring(*p->sole_location()).size());
    + DOIT(symbol_name_name, p->kprobe_symbol.size());
    #undef DOIT
    }

    @@ -4652,11 +4667,10 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
    if (s.verbose > 2) clog << "stap_dwarf_probe *" << #var << endl; \
    }

    - CALCIT(module);
    - CALCIT(section);
    CALCIT(pp);
    + CALCIT(symbol_name);

    - s.op->newline() << "const unsigned long address;";
    + s.op->newline() << "unsigned int offset;";
    s.op->newline() << "void (* const ph) (struct context*);";
    s.op->newline(-1) << "} stap_dwarf_probes[] = {";
    s.op->indent(1);
    @@ -4673,9 +4687,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
    assert (p->maxactive_val >= 0 && p->maxactive_val <= USHRT_MAX);
    s.op->line() << " .maxactive_val=" << p->maxactive_val << ",";
    }
    - s.op->line() << " .address=0x" << hex << p->addr << dec << "UL,";
    - s.op->line() << " .module=\"" << p->module << "\",";
    - s.op->line() << " .section=\"" << p->section << "\",";
    + s.op->line() << " .symbol_name=\"" << p->kprobe_symbol << "\",";
    + s.op->line() << " .offset=0x" << hex << p->kprobe_offset << dec << ",";
    s.op->line() << " .pp=" << lex_cast_qstring (*p->sole_location()) << ",";
    s.op->line() << " .ph=&" << p->name;
    s.op->line() << " },";
    @@ -4735,11 +4748,10 @@ dwarf_derived_probe_group::emit_module_init (systemtap_session& s)
    s.op->newline() << "for (i=0; i<" << probes_by_module.size() << "; i++) {";
    s.op->newline(1) << "struct stap_dwarf_probe *sdp = & stap_dwarf_probes[i];";
    s.op->newline() << "struct stap_dwarf_kprobe *kp = & stap_dwarf_kprobes[i];";
    - s.op->newline() << "unsigned long relocated_addr = _stp_module_relocate (sdp->module, sdp->section, sdp->address);";
    - s.op->newline() << "if (relocated_addr == 0) continue;"; // quietly; assume module is absent
    s.op->newline() << "probe_point = sdp->pp;";
    s.op->newline() << "if (sdp->return_p) {";
    - s.op->newline(1) << "kp->u.krp.kp.addr = (void *) relocated_addr;";
    + s.op->newline(1) << "kp->u.krp.kp.symbol_name = sdp->symbol_name;";
    + s.op->newline(1) << "kp->u.krp.kp.offset = sdp->offset;";
    s.op->newline() << "if (sdp->maxactive_p) {";
    s.op->newline(1) << "kp->u.krp.maxactive = sdp->maxactive_val;";
    s.op->newline(-1) << "} else {";
    @@ -4748,7 +4760,8 @@ dwarf_derived_probe_group::emit_module_init (systemtap_session& s)
    s.op->newline() << "kp->u.krp.handler = &enter_kretprobe_probe;";
    s.op->newline() << "rc = register_kretprobe (& kp->u.krp);";
    s.op->newline(-1) << "} else {";
    - s.op->newline(1) << "kp->u.kp.addr = (void *) relocated_addr;";
    + s.op->newline(1) << "kp->u.krp.kp.symbol_name = sdp->symbol_name;";
    + s.op->newline(1) << "kp->u.krp.kp.offset = sdp->offset;";
    s.op->newline() << "kp->u.kp.pre_handler = &enter_kprobe_probe;";
    s.op->newline() << "rc = register_kprobe (& kp->u.kp);";
    s.op->newline(-1) << "}";
    @@ -4885,12 +4898,20 @@ dwarf_builder::build(systemtap_session & sess,
    throw semantic_error ("absolute statement probe in unprivileged script", q.base_probe->tok);
    }

    + struct module_info *mi = dw->mod_info;
    + if (!mi->sym_table)
    + mi->get_symtab(&q);
    + struct symbol_table *sym_tab = mi->sym_table;
    + func_info *symbol = sym_tab->get_func_containing_address(q.statement_num_val);
    +
    // For kernel.statement(NUM).absolute probe points, we bypass
    // all the debuginfo stuff: We just wire up a
    // dwarf_derived_probe right here and now.
    dwarf_derived_probe* p =
    new dwarf_derived_probe ("", "", 0, "kernel", "",
    - q.statement_num_val, q.statement_num_val,
    + q.statement_num_val,
    + symbol->name,
    + (unsigned int)(q.statement_num_val - symbol->addr),
    q, 0);
    finished_results.push_back (p);
    sess.unwindsym_modules.insert ("kernel");


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    James Bottomley wrote:
    > One of the big nasties of systemtap is the way it tries to embed
    > virtually the entirety of the kernel symbol table in the probe modules
    > it constructs. This is highly undesirable because it represents a
    > subversion of the kernel API to gain access to unexported symbols. At
    > least for kprobes, the correct way to do this is to specify the probe
    > point by symbol and offset.
    >
    > This patch converts systemtap to use the correct kprobe
    > symbol_name/offset pair to identify the probe location.


    Hi James,

    I think your suggestion is a good step. Of course, it might
    have to solve some issues.

    Unfortunately, current kprobe's symbol_name interface is not
    so clever. For example, if you specify a static function
    which is defined at several places in the kernel(ex. do_open),
    it always pick up the first one in kallsyms, even if systemtap
    can find all of those functions.
    (you can find many duplicated symbols in /proc/kallsyms)

    So, we might better improve kallsyms to treat this case
    and find what is a better way to specify symbols and addresses.

    >
    > This only represents a baby step: after this is done, there are at
    > least three other consumers of the systemtap module relocation
    > machinery:
    >
    > 1. unwind information. I think the consumers of this can be
    > converted to use the arch specific unwinders that already exist
    > within the kernel
    > 2. systemtap specific functions that use kernel internals. This
    > was things like get_cycles() but I think they all now use a
    > sanctioned API ... need to check


    Sure, those functions must be well isolated from other parts of kernel.
    unfortunately, relayfs is not enough isolated. see below;
    http://sources.redhat.com/bugzilla/show_bug.cgi?id=6487

    > 3. Access to unexported global variables used by the probes. This
    > one is a bit tricky; the dwarf gives a probe the ability to
    > access any variable available from the probed stack frame,
    > including all globals. We could just make the globals off
    > limits, but that weakens the value of the debugger.
    > Alternatively, we could expand the kprobe API to allow probes
    > access to named global variables (tricky to get right without
    > effectively giving general symbol access). Thoughts?


    Could we provide a separated GPL'd interface to access named global
    symbols which is based on kallsyms?

    Thank you,

    > If you're going to try this out, you currently need to specify --kelf on
    > the command line to tell systemtap to use the kernel elf to derive
    > symbol names and offsets (it will just segfault without this ATM).
    >
    > James


    --
    Masami Hiramatsu

    Software Engineer
    Hitachi Computer Products (America) Inc.
    Software Solutions Division

    e-mail: mhiramat@redhat.com

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    On Wed, 2008-07-16 at 18:40 -0400, Masami Hiramatsu wrote:
    > James Bottomley wrote:
    > > One of the big nasties of systemtap is the way it tries to embed
    > > virtually the entirety of the kernel symbol table in the probe modules
    > > it constructs. This is highly undesirable because it represents a
    > > subversion of the kernel API to gain access to unexported symbols. At
    > > least for kprobes, the correct way to do this is to specify the probe
    > > point by symbol and offset.
    > >
    > > This patch converts systemtap to use the correct kprobe
    > > symbol_name/offset pair to identify the probe location.

    >
    > Hi James,
    >
    > I think your suggestion is a good step. Of course, it might
    > have to solve some issues.
    >
    > Unfortunately, current kprobe's symbol_name interface is not
    > so clever. For example, if you specify a static function
    > which is defined at several places in the kernel(ex. do_open),
    > it always pick up the first one in kallsyms, even if systemtap
    > can find all of those functions.
    > (you can find many duplicated symbols in /proc/kallsyms)


    Right, but realistically only functions which have a strict existence
    (i.e. those for whom an address could be taken) can be used; functions
    which are fully inlined (as in have no separate existence) can't.
    That's why the patch finds the closest function with an address to match
    on.

    > So, we might better improve kallsyms to treat this case
    > and find what is a better way to specify symbols and addresses.


    Well, both the dwarf and the kallsyms know which are the functions that
    have a real existence, so the tool can work it out. It has a real
    meaning too because the chosen symbol must be the parent routine of all
    the nested inlines.

    > > This only represents a baby step: after this is done, there are at
    > > least three other consumers of the systemtap module relocation
    > > machinery:
    > >
    > > 1. unwind information. I think the consumers of this can be
    > > converted to use the arch specific unwinders that already exist
    > > within the kernel
    > > 2. systemtap specific functions that use kernel internals. This
    > > was things like get_cycles() but I think they all now use a
    > > sanctioned API ... need to check

    >
    > Sure, those functions must be well isolated from other parts of kernel.
    > unfortunately, relayfs is not enough isolated. see below;
    > http://sources.redhat.com/bugzilla/show_bug.cgi?id=6487


    This is just "who guards the guards" or in this case, you can't probe
    pieces of the kernel that the probe internals use. However, as long as
    the separation is tight this shouldn't be too much of a problem.

    > > 3. Access to unexported global variables used by the probes. This
    > > one is a bit tricky; the dwarf gives a probe the ability to
    > > access any variable available from the probed stack frame,
    > > including all globals. We could just make the globals off
    > > limits, but that weakens the value of the debugger.
    > > Alternatively, we could expand the kprobe API to allow probes
    > > access to named global variables (tricky to get right without
    > > effectively giving general symbol access). Thoughts?

    >
    > Could we provide a separated GPL'd interface to access named global
    > symbols which is based on kallsyms?


    Yes, I think so ... it's just a case of working out what and how; but to
    do that we need a consumer of the interface.

    James


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    James Bottomley wrote:
    > On Wed, 2008-07-16 at 18:40 -0400, Masami Hiramatsu wrote:
    >> James Bottomley wrote:
    >>> One of the big nasties of systemtap is the way it tries to embed
    >>> virtually the entirety of the kernel symbol table in the probe modules
    >>> it constructs. This is highly undesirable because it represents a
    >>> subversion of the kernel API to gain access to unexported symbols. At
    >>> least for kprobes, the correct way to do this is to specify the probe
    >>> point by symbol and offset.
    >>>
    >>> This patch converts systemtap to use the correct kprobe
    >>> symbol_name/offset pair to identify the probe location.

    >> Hi James,
    >>
    >> I think your suggestion is a good step. Of course, it might
    >> have to solve some issues.
    >>
    >> Unfortunately, current kprobe's symbol_name interface is not
    >> so clever. For example, if you specify a static function
    >> which is defined at several places in the kernel(ex. do_open),
    >> it always pick up the first one in kallsyms, even if systemtap
    >> can find all of those functions.
    >> (you can find many duplicated symbols in /proc/kallsyms)

    >
    > Right, but realistically only functions which have a strict existence
    > (i.e. those for whom an address could be taken) can be used; functions
    > which are fully inlined (as in have no separate existence) can't.
    > That's why the patch finds the closest function with an address to match
    > on.


    Sure, inlined functions are embedded in a caller function, so the
    closest function is the correct owner.

    However, I meant local-scope functions can have same name if
    they are defined in different scope. And even though, both of
    them are shown in kallsyms. This mean, you can see the functions
    which have real different existence but have same symbol.

    Would you mean systemtap should not probe those name-conflicted
    functions?

    >> So, we might better improve kallsyms to treat this case
    >> and find what is a better way to specify symbols and addresses.

    >
    > Well, both the dwarf and the kallsyms know which are the functions that
    > have a real existence, so the tool can work it out. It has a real
    > meaning too because the chosen symbol must be the parent routine of all
    > the nested inlines.


    Hmm, here is what I got with your patch;
    $ stap --kelf -e 'probe kernel.function("do_open"){}' -p2
    # probes
    kernel.function("do_open@arch/x86/kernel/apm_32.c:1557") /* pc= */ /* <- kernel.function("do_open") */
    kernel.function("do_open@fs/block_dev.c:928") /* pc= */ /* <- kernel.function("do_open") */
    kernel.function("do_open@fs/nfsctl.c:24") /* pc= */ /* <- kernel.function("do_open") */
    kernel.function("do_open@ipc/mqueue.c:630") /* pc= */ /* <- kernel.function("do_open") */

    Without your patch;
    $ stap -e 'probe kernel.function("do_open"){}' -p2
    # probes
    kernel.function("do_open@arch/x86/kernel/apm_32.c:1557") /* pc=0x10382 */ /* <- kernel.function("do_open") */
    kernel.function("do_open@fs/block_dev.c:928") /* pc=0xa0750 */ /* <- kernel.function("do_open") */
    kernel.function("do_open@fs/nfsctl.c:24") /* pc=0xa6411 */ /* <- kernel.function("do_open") */
    kernel.function("do_open@ipc/mqueue.c:630") /* pc=0xc55a6 */ /* <- kernel.function("do_open") */

    Obviously, the 3rd "do_open" is fully inlined, so it can be
    correctly handled by kprobes, because it has different
    symbol(sys_nfsservctl). However, other "do_open" have
    same symbol(do_open). these will be put on same
    address (at the first symbol on kallsyms list).

    So, we need a bridge for the gap of function addresses
    between kallsyms and dwarf.

    [...]
    >> Could we provide a separated GPL'd interface to access named global
    >> symbols which is based on kallsyms?

    >
    > Yes, I think so ... it's just a case of working out what and how; but to
    > do that we need a consumer of the interface.


    I agree with you, we need to change both of systemtap and kernel.

    Thank you,

    >
    > James
    >
    >


    --
    Masami Hiramatsu

    Software Engineer
    Hitachi Computer Products (America) Inc.
    Software Solutions Division

    e-mail: mhiramat@redhat.com

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    On Wed, 2008-07-16 at 20:05 -0400, Masami Hiramatsu wrote:
    > James Bottomley wrote:
    > > On Wed, 2008-07-16 at 18:40 -0400, Masami Hiramatsu wrote:
    > >> James Bottomley wrote:
    > >>> One of the big nasties of systemtap is the way it tries to embed
    > >>> virtually the entirety of the kernel symbol table in the probe modules
    > >>> it constructs. This is highly undesirable because it represents a
    > >>> subversion of the kernel API to gain access to unexported symbols. At
    > >>> least for kprobes, the correct way to do this is to specify the probe
    > >>> point by symbol and offset.
    > >>>
    > >>> This patch converts systemtap to use the correct kprobe
    > >>> symbol_name/offset pair to identify the probe location.
    > >> Hi James,
    > >>
    > >> I think your suggestion is a good step. Of course, it might
    > >> have to solve some issues.
    > >>
    > >> Unfortunately, current kprobe's symbol_name interface is not
    > >> so clever. For example, if you specify a static function
    > >> which is defined at several places in the kernel(ex. do_open),
    > >> it always pick up the first one in kallsyms, even if systemtap
    > >> can find all of those functions.
    > >> (you can find many duplicated symbols in /proc/kallsyms)

    > >
    > > Right, but realistically only functions which have a strict existence
    > > (i.e. those for whom an address could be taken) can be used; functions
    > > which are fully inlined (as in have no separate existence) can't.
    > > That's why the patch finds the closest function with an address to match
    > > on.

    >
    > Sure, inlined functions are embedded in a caller function, so the
    > closest function is the correct owner.
    >
    > However, I meant local-scope functions can have same name if
    > they are defined in different scope. And even though, both of
    > them are shown in kallsyms. This mean, you can see the functions
    > which have real different existence but have same symbol.
    >
    > Would you mean systemtap should not probe those name-conflicted
    > functions?


    Actually, I wasn't aware we had any.

    > >> So, we might better improve kallsyms to treat this case
    > >> and find what is a better way to specify symbols and addresses.

    > >
    > > Well, both the dwarf and the kallsyms know which are the functions that
    > > have a real existence, so the tool can work it out. It has a real
    > > meaning too because the chosen symbol must be the parent routine of all
    > > the nested inlines.

    >
    > Hmm, here is what I got with your patch;
    > $ stap --kelf -e 'probe kernel.function("do_open"){}' -p2
    > # probes
    > kernel.function("do_open@arch/x86/kernel/apm_32.c:1557") /* pc= */ /* <- kernel.function("do_open") */
    > kernel.function("do_open@fs/block_dev.c:928") /* pc= */ /* <- kernel.function("do_open") */
    > kernel.function("do_open@fs/nfsctl.c:24") /* pc= */ /* <- kernel.function("do_open") */
    > kernel.function("do_open@ipc/mqueue.c:630") /* pc= */ /* <- kernel.function("do_open") */
    >
    > Without your patch;
    > $ stap -e 'probe kernel.function("do_open"){}' -p2
    > # probes
    > kernel.function("do_open@arch/x86/kernel/apm_32.c:1557") /* pc=0x10382 */ /* <- kernel.function("do_open") */
    > kernel.function("do_open@fs/block_dev.c:928") /* pc=0xa0750 */ /* <- kernel.function("do_open") */
    > kernel.function("do_open@fs/nfsctl.c:24") /* pc=0xa6411 */ /* <- kernel.function("do_open") */
    > kernel.function("do_open@ipc/mqueue.c:630") /* pc=0xc55a6 */ /* <- kernel.function("do_open") */
    >
    > Obviously, the 3rd "do_open" is fully inlined, so it can be
    > correctly handled by kprobes, because it has different
    > symbol(sys_nfsservctl). However, other "do_open" have
    > same symbol(do_open). these will be put on same
    > address (at the first symbol on kallsyms list).
    >
    > So, we need a bridge for the gap of function addresses
    > between kallsyms and dwarf.


    You mean this particular problem:

    hobholes:/home/jejb/git/BUILD-2.6# grep do_open /proc/kallsyms
    c01af160 t do_open
    c01d5d40 t do_open

    It's certainly a material defect in the current API. I'll think about
    it and see if I can come up with a solution.

    > [...]
    > >> Could we provide a separated GPL'd interface to access named global
    > >> symbols which is based on kallsyms?

    > >
    > > Yes, I think so ... it's just a case of working out what and how; but to
    > > do that we need a consumer of the interface.

    >
    > I agree with you, we need to change both of systemtap and kernel.
    >
    > Thank you,


    You're welcome.

    James


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    On Wed, 2008-07-16 at 20:49 -0500, James Bottomley wrote:
    > On Wed, 2008-07-16 at 20:05 -0400, Masami Hiramatsu wrote:
    > > Hmm, here is what I got with your patch;
    > > $ stap --kelf -e 'probe kernel.function("do_open"){}' -p2
    > > # probes
    > > kernel.function("do_open@arch/x86/kernel/apm_32.c:1557") /* pc= */ /* <- kernel.function("do_open") */
    > > kernel.function("do_open@fs/block_dev.c:928") /* pc= */ /* <- kernel.function("do_open") */
    > > kernel.function("do_open@fs/nfsctl.c:24") /* pc= */ /* <- kernel.function("do_open") */
    > > kernel.function("do_open@ipc/mqueue.c:630") /* pc= */ /* <- kernel.function("do_open") */
    > >
    > > Without your patch;
    > > $ stap -e 'probe kernel.function("do_open"){}' -p2
    > > # probes
    > > kernel.function("do_open@arch/x86/kernel/apm_32.c:1557") /* pc=0x10382 */ /* <- kernel.function("do_open") */
    > > kernel.function("do_open@fs/block_dev.c:928") /* pc=0xa0750 */ /* <- kernel.function("do_open") */
    > > kernel.function("do_open@fs/nfsctl.c:24") /* pc=0xa6411 */ /* <- kernel.function("do_open") */
    > > kernel.function("do_open@ipc/mqueue.c:630") /* pc=0xc55a6 */ /* <- kernel.function("do_open") */
    > >
    > > Obviously, the 3rd "do_open" is fully inlined, so it can be
    > > correctly handled by kprobes, because it has different
    > > symbol(sys_nfsservctl). However, other "do_open" have
    > > same symbol(do_open). these will be put on same
    > > address (at the first symbol on kallsyms list).
    > >
    > > So, we need a bridge for the gap of function addresses
    > > between kallsyms and dwarf.

    >
    > You mean this particular problem:
    >
    > hobholes:/home/jejb/git/BUILD-2.6# grep do_open /proc/kallsyms
    > c01af160 t do_open
    > c01d5d40 t do_open
    >
    > It's certainly a material defect in the current API. I'll think about
    > it and see if I can come up with a solution.


    OK, thought about it. There seem to be two possible solutions

    1. Get systemtap always to offset from non-static functions. This
    will use the standard linker to ensure uniqueness (a module
    qualifier will still need to be added to the struct kprobe for
    lookup, since modules can duplicate unexported kernel symbols).
    2. Add the filename as a discriminator for duplicate symbols in the
    kallsyms program (would still need module qualifier). This is
    appealing because the path name would be printed in the kernel
    trace to help with oops tracking

    This is where negotiations come in. To me 2. looks to be better because
    it will help us with oops tracking. On the other hand, it's usually
    pretty obvious from the stack trace context which files the duplicate
    symbols are actually in; what do other people think?

    James


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    On Thu, 2008-07-17 at 09:18 -0500, James Bottomley wrote:
    > OK, thought about it. There seem to be two possible solutions
    >
    > 1. Get systemtap always to offset from non-static functions. This
    > will use the standard linker to ensure uniqueness (a module
    > qualifier will still need to be added to the struct kprobe for
    > lookup, since modules can duplicate unexported kernel symbols).
    > 2. Add the filename as a discriminator for duplicate symbols in the
    > kallsyms program (would still need module qualifier). This is
    > appealing because the path name would be printed in the kernel
    > trace to help with oops tracking
    >
    > This is where negotiations come in. To me 2. looks to be better because
    > it will help us with oops tracking. On the other hand, it's usually
    > pretty obvious from the stack trace context which files the duplicate
    > symbols are actually in; what do other people think?


    Just by way of illustration, this is systemtap fixed up according to
    suggestion number 1. You can see now using your test case that we get:

    # probes
    kernel.function("do_open@fs/block_dev.c:929") /* pc= */ /* <- kernel.function("do_open") */
    kernel.function("do_open@fs/nfsctl.c:24") /* pc= */ /* <- kernel.function("do_open") */
    kernel.function("do_open@ipc/mqueue.c:642") /* pc= */ /* <- kernel.function("do_open") */

    James

    ---

    >From 0203b75a9ca97fc7463e6372baee897d1029b799 Mon Sep 17 00:00:00 2001

    From: James Bottomley
    Date: Tue, 15 Jul 2008 13:25:00 -0500
    Subject: Part1 use symbol_name/offset to locate dwarf probes

    ---
    tapsets.cxx | 119 +++++++++++++++++++++++++++++++++++++++++------------------
    1 files changed, 83 insertions(+), 36 deletions(-)

    diff --git a/tapsets.cxx b/tapsets.cxx
    index 4d1e469..da198c9 100644
    --- a/tapsets.cxx
    +++ b/tapsets.cxx
    @@ -491,6 +491,7 @@ func_info
    Dwarf_Addr entrypc;
    Dwarf_Addr prologue_end;
    bool weak;
    + bool global;
    // Comparison functor for list of functions sorted by address. The
    // two versions that take a Dwarf_Addr let us use the STL algorithms
    // upper_bound, equal_range et al., but we don't know whether the
    @@ -591,6 +592,7 @@ symbol_table
    module_info *mod_info; // associated module
    map map_by_name;
    vector list_by_addr;
    + vector global_list_by_addr;
    typedef vector::iterator iterator_t;
    typedef pair range_t;
    #ifdef __powerpc__
    @@ -598,7 +600,7 @@ symbol_table
    #endif
    // add_symbol doesn't leave symbol table in order; call
    // symbol_table::sort() when done adding symbols.
    - void add_symbol(const char *name, bool weak, Dwarf_Addr addr,
    + void add_symbol(const char *name, bool weak, bool global, Dwarf_Addr addr,
    Dwarf_Addr *high_addr);
    void sort();
    enum info_status read_symbols(FILE *f, const string& path);
    @@ -611,7 +613,7 @@ symbol_table
    void purge_syscall_stubs();
    func_info *lookup_symbol(const string& name);
    Dwarf_Addr lookup_symbol_address(const string& name);
    - func_info *get_func_containing_address(Dwarf_Addr addr);
    + func_info *get_func_containing_address(Dwarf_Addr addr, bool global);

    symbol_table(module_info *mi) : mod_info(mi) {}
    ~symbol_table();
    @@ -2306,13 +2308,15 @@ struct dwarf_derived_probe: public derived_probe
    const string& module,
    const string& section,
    Dwarf_Addr dwfl_addr,
    - Dwarf_Addr addr,
    + string symbol,
    + unsigned int offset,
    dwarf_query & q,
    Dwarf_Die* scope_die);

    string module;
    string section;
    - Dwarf_Addr addr;
    + string kprobe_symbol;
    + unsigned int kprobe_offset;
    bool has_return;
    bool has_maxactive;
    long maxactive_val;
    @@ -2835,7 +2839,7 @@ dwarf_query::query_module_symtab()
    // Find the "function" in which the indicated address resides.
    Dwarf_Addr addr =
    (has_function_num ? function_num_val : statement_num_val);
    - fi = sym_table->get_func_containing_address(addr);
    + fi = sym_table->get_func_containing_address(addr, false);
    if (!fi)
    {
    cerr << "Warning: address "
    @@ -3260,9 +3264,18 @@ dwarf_query::add_probe_point(const string& funcname,

    if (! bad)
    {
    + struct module_info *mi = dw.mod_info;
    + if (!mi->sym_table)
    + mi->get_symtab(this);
    + struct symbol_table *sym_tab = mi->sym_table;
    + func_info *symbol = sym_tab->get_func_containing_address(addr, true);
    +
    sess.unwindsym_modules.insert (module);
    probe = new dwarf_derived_probe(funcname, filename, line,
    - module, reloc_section, addr, reloc_addr, *this, scope_die);
    + module, reloc_section, reloc_addr,
    + symbol->name,
    + (unsigned int)(addr - symbol->addr),
    + *this, scope_die);
    results.push_back(probe);
    }
    }
    @@ -4380,7 +4393,8 @@ dwarf_derived_probe:rintsig (ostream& o) const
    // function instances. This is distinct from the verbose/clog
    // output, since this part goes into the cache hash calculations.
    sole_location()->print (o);
    - o << " /* pc=0x" << hex << addr << dec << " */";
    + o << " /* pc=<" << kprobe_symbol << "+0x" << hex
    + << kprobe_offset << dec << "> */";
    printsig_nested (o);
    }

    @@ -4406,22 +4420,26 @@ dwarf_derived_probe::dwarf_derived_probe(const string& funcname,
    // NB: dwfl_addr is the virtualized
    // address for this symbol.
    Dwarf_Addr dwfl_addr,
    - // addr is the section-offset for
    - // actual relocation.
    - Dwarf_Addr addr,
    + // symbol is the closest known symbol
    + // and offset is the offset from the symbol
    + string symbol,
    + unsigned int offset,
    dwarf_query& q,
    Dwarf_Die* scope_die /* may be null */)
    : derived_probe (q.base_probe, new probe_point(*q.base_loc) /* .components soon rewritten */ ),
    - module (module), section (section), addr (addr),
    + module (module), section (section), kprobe_symbol(symbol),
    + kprobe_offset(offset),
    has_return (q.has_return),
    has_maxactive (q.has_maxactive),
    maxactive_val (q.maxactive_val)
    {
    // Assert relocation invariants
    +#if 0
    if (section == "" && dwfl_addr != addr) // addr should be absolute
    throw semantic_error ("missing relocation base against", q.base_loc->tok);
    if (section != "" && dwfl_addr == addr) // addr should be an offset
    throw semantic_error ("inconsistent relocation address", q.base_loc->tok);
    +#endif

    this->tok = q.base_probe->tok;

    @@ -4620,8 +4638,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
    // Let's find some stats for the three embedded strings. Maybe they
    // are small and uniform enough to justify putting char[MAX]'s into
    // the array instead of relocated char*'s.
    - size_t module_name_max = 0, section_name_max = 0, pp_name_max = 0;
    - size_t module_name_tot = 0, section_name_tot = 0, pp_name_tot = 0;
    + size_t pp_name_max = 0, pp_name_tot = 0;
    + size_t symbol_name_name_max = 0, symbol_name_name_tot = 0;
    size_t all_name_cnt = probes_by_module.size(); // for average
    for (p_b_m_iterator it = probes_by_module.begin(); it != probes_by_module.end(); it++)
    {
    @@ -4630,9 +4648,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
    size_t var##_size = (expr) + 1; \
    var##_max = max (var##_max, var##_size); \
    var##_tot += var##_size; } while (0)
    - DOIT(module_name, p->module.size());
    - DOIT(section_name, p->section.size());
    DOIT(pp_name, lex_cast_qstring(*p->sole_location()).size());
    + DOIT(symbol_name_name, p->kprobe_symbol.size());
    #undef DOIT
    }

    @@ -4652,11 +4669,10 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
    if (s.verbose > 2) clog << "stap_dwarf_probe *" << #var << endl; \
    }

    - CALCIT(module);
    - CALCIT(section);
    CALCIT(pp);
    + CALCIT(symbol_name);

    - s.op->newline() << "const unsigned long address;";
    + s.op->newline() << "unsigned int offset;";
    s.op->newline() << "void (* const ph) (struct context*);";
    s.op->newline(-1) << "} stap_dwarf_probes[] = {";
    s.op->indent(1);
    @@ -4673,9 +4689,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
    assert (p->maxactive_val >= 0 && p->maxactive_val <= USHRT_MAX);
    s.op->line() << " .maxactive_val=" << p->maxactive_val << ",";
    }
    - s.op->line() << " .address=0x" << hex << p->addr << dec << "UL,";
    - s.op->line() << " .module=\"" << p->module << "\",";
    - s.op->line() << " .section=\"" << p->section << "\",";
    + s.op->line() << " .symbol_name=\"" << p->kprobe_symbol << "\",";
    + s.op->line() << " .offset=0x" << hex << p->kprobe_offset << dec << ",";
    s.op->line() << " .pp=" << lex_cast_qstring (*p->sole_location()) << ",";
    s.op->line() << " .ph=&" << p->name;
    s.op->line() << " },";
    @@ -4735,11 +4750,10 @@ dwarf_derived_probe_group::emit_module_init (systemtap_session& s)
    s.op->newline() << "for (i=0; i<" << probes_by_module.size() << "; i++) {";
    s.op->newline(1) << "struct stap_dwarf_probe *sdp = & stap_dwarf_probes[i];";
    s.op->newline() << "struct stap_dwarf_kprobe *kp = & stap_dwarf_kprobes[i];";
    - s.op->newline() << "unsigned long relocated_addr = _stp_module_relocate (sdp->module, sdp->section, sdp->address);";
    - s.op->newline() << "if (relocated_addr == 0) continue;"; // quietly; assume module is absent
    s.op->newline() << "probe_point = sdp->pp;";
    s.op->newline() << "if (sdp->return_p) {";
    - s.op->newline(1) << "kp->u.krp.kp.addr = (void *) relocated_addr;";
    + s.op->newline(1) << "kp->u.krp.kp.symbol_name = sdp->symbol_name;";
    + s.op->newline(1) << "kp->u.krp.kp.offset = sdp->offset;";
    s.op->newline() << "if (sdp->maxactive_p) {";
    s.op->newline(1) << "kp->u.krp.maxactive = sdp->maxactive_val;";
    s.op->newline(-1) << "} else {";
    @@ -4748,7 +4762,8 @@ dwarf_derived_probe_group::emit_module_init (systemtap_session& s)
    s.op->newline() << "kp->u.krp.handler = &enter_kretprobe_probe;";
    s.op->newline() << "rc = register_kretprobe (& kp->u.krp);";
    s.op->newline(-1) << "} else {";
    - s.op->newline(1) << "kp->u.kp.addr = (void *) relocated_addr;";
    + s.op->newline(1) << "kp->u.krp.kp.symbol_name = sdp->symbol_name;";
    + s.op->newline(1) << "kp->u.krp.kp.offset = sdp->offset;";
    s.op->newline() << "kp->u.kp.pre_handler = &enter_kprobe_probe;";
    s.op->newline() << "rc = register_kprobe (& kp->u.kp);";
    s.op->newline(-1) << "}";
    @@ -4885,12 +4900,20 @@ dwarf_builder::build(systemtap_session & sess,
    throw semantic_error ("absolute statement probe in unprivileged script", q.base_probe->tok);
    }

    + struct module_info *mi = dw->mod_info;
    + if (!mi->sym_table)
    + mi->get_symtab(&q);
    + struct symbol_table *sym_tab = mi->sym_table;
    + func_info *symbol = sym_tab->get_func_containing_address(q.statement_num_val, true);
    +
    // For kernel.statement(NUM).absolute probe points, we bypass
    // all the debuginfo stuff: We just wire up a
    // dwarf_derived_probe right here and now.
    dwarf_derived_probe* p =
    new dwarf_derived_probe ("", "", 0, "kernel", "",
    - q.statement_num_val, q.statement_num_val,
    + q.statement_num_val,
    + symbol->name,
    + (unsigned int)(q.statement_num_val - symbol->addr),
    q, 0);
    finished_results.push_back (p);
    sess.unwindsym_modules.insert ("kernel");
    @@ -4908,8 +4931,8 @@ symbol_table::~symbol_table()
    }

    void
    -symbol_table::add_symbol(const char *name, bool weak, Dwarf_Addr addr,
    - Dwarf_Addr *high_addr)
    +symbol_table::add_symbol(const char *name, bool weak, bool global,
    + Dwarf_Addr addr, Dwarf_Addr *high_addr)
    {
    #ifdef __powerpc__
    // Map ".sys_foo" to "sys_foo".
    @@ -4920,10 +4943,13 @@ symbol_table::add_symbol(const char *name, bool weak, Dwarf_Addr addr,
    fi->addr = addr;
    fi->name = name;
    fi->weak = weak;
    + fi->global = global;
    map_by_name[fi->name] = fi;
    // TODO: Use a multimap in case there are multiple static
    // functions with the same name?
    list_by_addr.push_back(fi);
    + if (global)
    + global_list_by_addr.push_back(fi);
    }

    enum info_status
    @@ -4961,7 +4987,8 @@ symbol_table::read_symbols(FILE *f, const string& path)
    break;
    }
    if (type == 'T' || type == 't' || type == 'W')
    - add_symbol(name, (type == 'W'), (Dwarf_Addr) addr, &high_addr);
    + add_symbol(name, (type == 'W'), (type == 'T'),
    + (Dwarf_Addr) addr, &high_addr);
    }

    if (list_by_addr.size() < 1)
    @@ -5080,7 +5107,8 @@ symbol_table::get_from_elf()
    if (name && GELF_ST_TYPE(sym.st_info) == STT_FUNC &&
    !reject_section(section))
    add_symbol(name, (GELF_ST_BIND(sym.st_info) == STB_WEAK),
    - sym.st_value, &high_addr);
    + GELF_ST_BIND(sym.st_info) == STB_GLOBAL,
    + sym.st_value, &high_addr);
    }
    sort();
    return info_present;
    @@ -5121,14 +5149,29 @@ symbol_table::mark_dwarf_redundancies(dwflpp *dw)
    }

    func_info *
    -symbol_table::get_func_containing_address(Dwarf_Ad dr addr)
    +symbol_table::get_func_containing_address(Dwarf_A ddr addr, bool global)
    {
    - iterator_t iter = upper_bound(list_by_addr.begin(), list_by_addr.end(), addr,
    - func_info::Compare());
    - if (iter == list_by_addr.begin())
    - return NULL;
    + iterator_t iter;
    +
    + if (global)
    + {
    + iter = upper_bound(global_list_by_addr.begin(),
    + global_list_by_addr.end(), addr,
    + func_info::Compare());
    + if (iter == global_list_by_addr.begin())
    + return NULL;
    + else
    + return *(iter - 1);
    + }
    else
    - return *(iter - 1);
    + {
    + iter = upper_bound(list_by_addr.begin(), list_by_addr.end(), addr,
    + func_info::Compare());
    + if (iter == list_by_addr.begin())
    + return NULL;
    + else
    + return *(iter - 1);
    + }
    }

    func_info *
    @@ -5181,12 +5224,16 @@ symbol_table:urge_syscall_stubs()
    list_by_addr.erase(remove(purge_range.first, purge_range.second,
    (func_info*)0),
    purge_range.second);
    + // NOTE: At the moment global_list_by_addr has no weak symbols
    + // so nothing needs to be removed from it.
    }

    void
    symbol_table::sort()
    {
    stable_sort(list_by_addr.begin(), list_by_addr.end(), func_info::Compare());
    + stable_sort(global_list_by_addr.begin(), global_list_by_addr.end(),
    + func_info::Compare());
    }

    void
    --
    1.5.6



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    James Bottomley writes:

    > [...]
    > Just by way of illustration, this is systemtap fixed up according to
    > suggestion number 1. You can see now using your test case that we get:
    >
    > # probes
    > kernel.function("do_open@fs/block_dev.c:929") /* pc= */ /* <- kernel.function("do_open") */
    > kernel.function("do_open@fs/nfsctl.c:24") /* pc= */ /* <- kernel.function("do_open") */
    > kernel.function("do_open@ipc/mqueue.c:642") /* pc= */ /* <- kernel.function("do_open") */
    > [...]


    Can you explain in detail how you believe this is materially
    different from offsetting from _stext?

    - FChE
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    On Thu, 2008-07-17 at 14:30 -0400, Frank Ch. Eigler wrote:
    > James Bottomley writes:
    >
    > > [...]
    > > Just by way of illustration, this is systemtap fixed up according to
    > > suggestion number 1. You can see now using your test case that we get:
    > >
    > > # probes
    > > kernel.function("do_open@fs/block_dev.c:929") /* pc= */ /* <- kernel.function("do_open") */
    > > kernel.function("do_open@fs/nfsctl.c:24") /* pc= */ /* <- kernel.function("do_open") */
    > > kernel.function("do_open@ipc/mqueue.c:642") /* pc= */ /* <- kernel.function("do_open") */
    > > [...]

    >
    > Can you explain in detail how you believe this is materially
    > different from offsetting from _stext?


    Basically because _stext is an incredibly dangerous symbol; being linker
    generated it doesn't actually get put in the right place if you look:

    jejb@sparkweed> nm vmlinux |egrep -w '_stext|_text'
    ffffffff80209000 T _stext
    ffffffff80200000 A _text

    Since we can't do negative offsets, you've lost access to the symbols in
    the sections that start before _stext. Assuming you meant _text (which
    is dangerous because it's a define in the kernel linker script and could
    change). Then you can't offset into other sections, like init sections
    or modules.

    James


    James


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    Hi -

    On Thu, Jul 17, 2008 at 03:12:26PM -0500, James Bottomley wrote:
    > [...]
    > > Can you explain in detail how you believe this is materially
    > > different from offsetting from _stext?

    >
    > Basically because _stext is an incredibly dangerous symbol; being linker
    > generated it doesn't actually get put in the right place if you look:


    Thank you for your response.

    > jejb@sparkweed> nm vmlinux |egrep -w '_stext|_text'
    > ffffffff80209000 T _stext
    > ffffffff80200000 A _text
    >
    > Since we can't do negative offsets


    Actually, "we" as in systemtap could do it just fine if that were
    desired. And really _stext is therefore an arbitrary choice - it
    could be any other reference.

    My point is that the proposed effort to identify a nearby function
    symbol to use as a base for each probe's symbol+offset calculation is
    wasted.


    > you've lost access to the symbols in the sections that start before _stext.


    What's between _text and _stext appears to consist of kernel boot-time
    functions that are unmapped the time anything like systemtap could
    run.


    > Assuming you meant _text (which is dangerous because it's a define
    > in the kernel linker script and could change).


    By "dangerous" do you only mean that it may require a one-liner
    catch-up patch in systemtap if the kernel linker scripts change?


    > Then you can't offset into other sections, like init sections or
    > modules.


    Kernel init sections are unprobeable by definition, so that doesn't
    matter. Modules are also irrelevant, since their addresses are
    relative to their relocation bases / sections, not to a kernel
    (vmlinux) symbol.


    - FChE
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    On Thu, 2008-07-17 at 16:26 -0400, Frank Ch. Eigler wrote:
    > Hi -
    >
    > On Thu, Jul 17, 2008 at 03:12:26PM -0500, James Bottomley wrote:
    > > [...]
    > > > Can you explain in detail how you believe this is materially
    > > > different from offsetting from _stext?

    > >
    > > Basically because _stext is an incredibly dangerous symbol; being linker
    > > generated it doesn't actually get put in the right place if you look:

    >
    > Thank you for your response.
    >
    > > jejb@sparkweed> nm vmlinux |egrep -w '_stext|_text'
    > > ffffffff80209000 T _stext
    > > ffffffff80200000 A _text
    > >
    > > Since we can't do negative offsets

    >
    > Actually, "we" as in systemtap could do it just fine if that were
    > desired. And really _stext is therefore an arbitrary choice - it
    > could be any other reference.
    >
    > My point is that the proposed effort to identify a nearby function
    > symbol to use as a base for each probe's symbol+offset calculation is
    > wasted.


    It's not exactly wasted ... the calculations have to be done anyway for
    modules.

    > > you've lost access to the symbols in the sections that start before _stext.

    >
    > What's between _text and _stext appears to consist of kernel boot-time
    > functions that are unmapped the time anything like systemtap could
    > run.


    Well, no, they're the head code. It's actually used in CPU boot and
    tear down, one of the things it's useful to probe, I think.

    > > Assuming you meant _text (which is dangerous because it's a define
    > > in the kernel linker script and could change).

    >
    > By "dangerous" do you only mean that it may require a one-liner
    > catch-up patch in systemtap if the kernel linker scripts change?


    Dangerous as in it's not necessarily part of the kernel linker scripts.
    Some arches have it defined as a symbol, some have it as a linker script
    definition ... that's why it's location is strange.

    The point, really, is to remove some of the fragile dependencies between
    systemtap and the kernel.

    > > Then you can't offset into other sections, like init sections or
    > > modules.

    >
    > Kernel init sections are unprobeable by definition, so that doesn't
    > matter. Modules are also irrelevant, since their addresses are
    > relative to their relocation bases / sections, not to a kernel
    > (vmlinux) symbol.


    Then the definition needs altering. I can see that the industrial
    customers aren't interested but kernel developers are ... a lot of
    problems occur in the init sections.

    I think you'll find that systemtap will run quite happily from a shell
    in an initramfs before the init sections are discarded. Plus there's
    always module init sections which can appear at any time.

    James


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    Hi -


    On Thu, Jul 17, 2008 at 04:06:09PM -0500, James Bottomley wrote:

    > [...]
    > > My point is that the proposed effort to identify a nearby function
    > > symbol to use as a base for each probe's symbol+offset calculation is
    > > wasted.

    >
    > It's not exactly wasted ... the calculations have to be done anyway for
    > modules.


    Not really - we just anchor off a different (per-module) reference
    symbol or address. At the moment, we use the .text* section bases.


    > > > you've lost access to the symbols in the sections that start before _stext.

    > >
    > > What's between _text and _stext appears to consist of kernel boot-time
    > > functions that are unmapped the time anything like systemtap could
    > > run.

    >
    > Well, no, they're the head code. It's actually used in CPU boot and
    > tear down, one of the things it's useful to probe, I think.


    Fair enough - conceivably probing that stuff is useful, as is module
    initialization. We don't try to do it yet (and indeed kprobes blocks
    it all).

    In any case, the method of probe address calculation doesn't affect
    that issue. We can calculate .init* addresses relative to any
    convenient reference in exactly the same way as non-.init addresses.


    > > > Assuming you meant _text (which is dangerous because it's a define
    > > > in the kernel linker script and could change).

    > >
    > > By "dangerous" do you only mean that it may require a one-liner
    > > catch-up patch in systemtap if the kernel linker scripts change?

    >
    > Dangerous as in it's not necessarily part of the kernel linker scripts.
    > [...]
    > The point, really, is to remove some of the fragile dependencies between
    > systemtap and the kernel.


    Yes, that is generally desirable - each case is usually a question of
    cost/benefit. One significant requirement for us is to keep working
    with older kernels.


    - FChE
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    James Bottomley wrote:
    > On Thu, 2008-07-17 at 09:18 -0500, James Bottomley wrote:
    >> OK, thought about it. There seem to be two possible solutions
    >>
    >> 1. Get systemtap always to offset from non-static functions. This
    >> will use the standard linker to ensure uniqueness (a module
    >> qualifier will still need to be added to the struct kprobe for
    >> lookup, since modules can duplicate unexported kernel symbols).
    >> 2. Add the filename as a discriminator for duplicate symbols in the
    >> kallsyms program (would still need module qualifier). This is
    >> appealing because the path name would be printed in the kernel
    >> trace to help with oops tracking
    >>
    >> This is where negotiations come in. To me 2. looks to be better because
    >> it will help us with oops tracking. On the other hand, it's usually
    >> pretty obvious from the stack trace context which files the duplicate
    >> symbols are actually in; what do other people think?

    >
    > Just by way of illustration, this is systemtap fixed up according to
    > suggestion number 1. You can see now using your test case that we get:
    >
    > # probes
    > kernel.function("do_open@fs/block_dev.c:929") /* pc= */ /* <- kernel.function("do_open") */
    > kernel.function("do_open@fs/nfsctl.c:24") /* pc= */ /* <- kernel.function("do_open") */
    > kernel.function("do_open@ipc/mqueue.c:642") /* pc= */ /* <- kernel.function("do_open") */


    Hi James,

    Thank you for updating the patch.
    Unfortunately, I found another scenario; if someone make a module which
    has EXPORT_SYMBOL(do_open), it's a non-static function. but there are
    other static version do_open in kallsyms.
    Here, I tested it and got below;

    $ stap --kelf -e 'probe module("test").function("do_open"){}' -p2
    # probes
    module("test").function("do_open@?") /* pc= */ /* <- module("test").function("do_open") */

    And I think similar issue will occur even if it is embedded in vmlinux.

    By the way, can this patch solve the issue of -ffunction-sections?

    Anyway, I think we still need solution no.2.


    Thank you,

    --
    Masami Hiramatsu

    Software Engineer
    Hitachi Computer Products (America) Inc.
    Software Solutions Division

    e-mail: mhiramat@redhat.com

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    Frank Ch. Eigler wrote:
    > Hi -
    >
    >
    > On Thu, Jul 17, 2008 at 04:06:09PM -0500, James Bottomley wrote:
    >
    >> [...]
    >>> My point is that the proposed effort to identify a nearby function
    >>> symbol to use as a base for each probe's symbol+offset calculation is
    >>> wasted.

    >> It's not exactly wasted ... the calculations have to be done anyway for
    >> modules.

    >
    > Not really - we just anchor off a different (per-module) reference
    > symbol or address. At the moment, we use the .text* section bases.
    >
    >
    >>>> you've lost access to the symbols in the sections that start before _stext.
    >>> What's between _text and _stext appears to consist of kernel boot-time
    >>> functions that are unmapped the time anything like systemtap could
    >>> run.

    >> Well, no, they're the head code. It's actually used in CPU boot and
    >> tear down, one of the things it's useful to probe, I think.

    >
    > Fair enough - conceivably probing that stuff is useful, as is module
    > initialization. We don't try to do it yet (and indeed kprobes blocks
    > it all).
    >
    > In any case, the method of probe address calculation doesn't affect
    > that issue. We can calculate .init* addresses relative to any
    > convenient reference in exactly the same way as non-.init addresses.
    >
    >
    >>>> Assuming you meant _text (which is dangerous because it's a define
    >>>> in the kernel linker script and could change).
    >>> By "dangerous" do you only mean that it may require a one-liner
    >>> catch-up patch in systemtap if the kernel linker scripts change?

    >> Dangerous as in it's not necessarily part of the kernel linker scripts.
    >> [...]
    >> The point, really, is to remove some of the fragile dependencies between
    >> systemtap and the kernel.

    >
    > Yes, that is generally desirable - each case is usually a question of
    > cost/benefit. One significant requirement for us is to keep working
    > with older kernels.


    Hi Frank,

    I know we'd better archive that requirement. However, if we lose support
    from developers because we are too much focusing on that, we'll also
    lose the future of systemtap itself. We have to see the cost/benefit
    from the long-term of view.

    Could we separate systemtap parser/elaborator and code generator
    to support both of old and new kernels?

    Thank you,

    --
    Masami Hiramatsu

    Software Engineer
    Hitachi Computer Products (America) Inc.
    Software Solutions Division

    e-mail: mhiramat@redhat.com

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    On Thu, 2008-07-17 at 17:36 -0400, Masami Hiramatsu wrote:
    > James Bottomley wrote:
    > > On Thu, 2008-07-17 at 09:18 -0500, James Bottomley wrote:
    > >> OK, thought about it. There seem to be two possible solutions
    > >>
    > >> 1. Get systemtap always to offset from non-static functions. This
    > >> will use the standard linker to ensure uniqueness (a module
    > >> qualifier will still need to be added to the struct kprobe for
    > >> lookup, since modules can duplicate unexported kernel symbols).
    > >> 2. Add the filename as a discriminator for duplicate symbols in the
    > >> kallsyms program (would still need module qualifier). This is
    > >> appealing because the path name would be printed in the kernel
    > >> trace to help with oops tracking
    > >>
    > >> This is where negotiations come in. To me 2. looks to be better because
    > >> it will help us with oops tracking. On the other hand, it's usually
    > >> pretty obvious from the stack trace context which files the duplicate
    > >> symbols are actually in; what do other people think?

    > >
    > > Just by way of illustration, this is systemtap fixed up according to
    > > suggestion number 1. You can see now using your test case that we get:
    > >
    > > # probes
    > > kernel.function("do_open@fs/block_dev.c:929") /* pc= */ /* <- kernel.function("do_open") */
    > > kernel.function("do_open@fs/nfsctl.c:24") /* pc= */ /* <- kernel.function("do_open") */
    > > kernel.function("do_open@ipc/mqueue.c:642") /* pc= */ /* <- kernel.function("do_open") */

    >
    > Hi James,
    >
    > Thank you for updating the patch.
    > Unfortunately, I found another scenario; if someone make a module which
    > has EXPORT_SYMBOL(do_open), it's a non-static function. but there are
    > other static version do_open in kallsyms.
    > Here, I tested it and got below;
    >
    > $ stap --kelf -e 'probe module("test").function("do_open"){}' -p2
    > # probes
    > module("test").function("do_open@?") /* pc= */ /* <- module("test").function("do_open") */
    >
    > And I think similar issue will occur even if it is embedded in vmlinux.


    Actually, no. This is only a module problem ... it's triggered by the
    fact that the module namespace is different from the kernel's global
    namespace. To get around this, I think the actual module (or null for
    kernel) has to become an extra parameter to struct kprobe.

    > By the way, can this patch solve the issue of -ffunction-sections?


    Actually not entirely, no, if we go for only global symbols. The
    compiler is entitled to spit out a section even for a static function as
    long as it has a real body. If the module loader insterts stubs then
    even an offset from a nearby function could end up being wrong

    > Anyway, I think we still need solution no.2.


    I'll cook up a patch and run it by lkml to try to gauge the reaction.

    James


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    James Bottomley writes:

    > One of the big nasties of systemtap is the way it tries to embed
    > virtually the entirety of the kernel symbol table in the probe modules
    > it constructs. This is highly undesirable because it represents a
    > subversion of the kernel API to gain access to unexported symbols. At
    > least for kprobes, the correct way to do this is to specify the probe
    > point by symbol and offset.
    >
    > This patch converts systemtap to use the correct kprobe
    > symbol_name/offset pair to identify the probe location.
    >
    > This only represents a baby step: after this is done, there are at
    > least three other consumers of the systemtap module relocation
    > machinery:
    >
    > 1. unwind information. I think the consumers of this can be
    > converted to use the arch specific unwinders that already exist
    > within the kernel



    Right now x86 doesn't really have a good reliable unwinder that
    works without frame pointer. I think systemtap
    recently switched to Jan Beulich's dwarf2 unwinder. Before
    switching to the in kernel unwinder that one would need to be
    re-merged again.

    -Andi

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    On Fri, 2008-07-18 at 11:11 +0200, Andi Kleen wrote:
    > James Bottomley writes:
    >
    > > One of the big nasties of systemtap is the way it tries to embed
    > > virtually the entirety of the kernel symbol table in the probe modules
    > > it constructs. This is highly undesirable because it represents a
    > > subversion of the kernel API to gain access to unexported symbols. At
    > > least for kprobes, the correct way to do this is to specify the probe
    > > point by symbol and offset.
    > >
    > > This patch converts systemtap to use the correct kprobe
    > > symbol_name/offset pair to identify the probe location.
    > >
    > > This only represents a baby step: after this is done, there are at
    > > least three other consumers of the systemtap module relocation
    > > machinery:
    > >
    > > 1. unwind information. I think the consumers of this can be
    > > converted to use the arch specific unwinders that already exist
    > > within the kernel

    >
    >
    > Right now x86 doesn't really have a good reliable unwinder that
    > works without frame pointer. I think systemtap
    > recently switched to Jan Beulich's dwarf2 unwinder. Before
    > switching to the in kernel unwinder that one would need to be
    > re-merged again.


    Those are two separate issues.

    1) stap ought to use the kernel's infrastructure and not re-implement
    its own.

    2) if the kernel's infrastructure doesn't meet requirements, improve
    it.

    But while the x86 might not be perfect, its fairly ok these days. Its
    not the utter piece of ****e x86_64 had for a long time - today's traces
    mostly make sense.



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)


    > 1) stap ought to use the kernel's infrastructure and not re-implement
    > its own.
    >
    > 2) if the kernel's infrastructure doesn't meet requirements, improve
    > it.


    No argument on either of those. Right now the kernel infrastructure
    is only comparable to what systemtap overs at very high overhead
    costs (see below)

    > But while the x86 might not be perfect, its fairly ok these days. Its
    > not the utter piece of ****e x86_64 had for a long time


    Not sure what you're referring to with this. AFAIK the x86-64 unwinder
    for a normal frame pointer less kernel was not any worse (or better)
    than a i386 kernel without frame pointers.

    - today's traces
    > mostly make sense.


    If you enable frame pointers? Making your complete kernel slower?
    Generating much worse code on i386 by wasting >20% of its available
    registers? Getting pipeline stalls on each function call/exit on many CPUs?

    Right now unfortunately there are a few rogue CONFIGs who select that
    so more and more kernels have, but I found that always distateful because
    enabling frame pointers has such a large impact on all kernel code, especially
    on the register starved i386.

    I still think the right solution eventually is to have a dwarf2 unwinder
    by default for i386/x86-64 and get rid of all these nasty "select
    FRAME_POINTER"s which have unfortunately sneaked in.

    -Andi

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    Peter Zijlstra wrote:
    > On Fri, 2008-07-18 at 12:31 +0200, Andi Kleen wrote:
    >
    >>> But while the x86 might not be perfect, its fairly ok these days. Its
    >>> not the utter piece of ****e x86_64 had for a long time

    >> Not sure what you're referring to with this. AFAIK the x86-64 unwinder
    >> for a normal frame pointer less kernel was not any worse (or better)
    >> than a i386 kernel without frame pointers.

    >
    > I hardly ever compile a kernel without frame pointers, as debugability
    > is top priority for me, so I'm afraid I'm not sure how bad it gets
    > without.
    >
    > But it used to be that x86_64 was crap even with frame pointers, and
    > without it it was just random gibberish - Arjan fixed that up somewhere
    > around .25 iirc.


    Well yes it was designed for dwarf2 unwinding and Linus' revert of that
    wrecked it. Also even the frame pointer walker was in the dwarf2
    code, so with that removed it fell back to a somewhat dumb unwinder.

    AFAIK most of the assembler code still doesn't set up frame pointers,
    so you'll likely still run into problems.

    >
    >> - today's traces
    >>> mostly make sense.

    >> If you enable frame pointers? Making your complete kernel slower?
    >> Generating much worse code on i386 by wasting >20% of its available
    >> registers? Getting pipeline stalls on each function call/exit on many CPUs?
    >>
    >> Right now unfortunately there are a few rogue CONFIGs who select that
    >> so more and more kernels have, but I found that always distateful because
    >> enabling frame pointers has such a large impact on all kernel code, especially
    >> on the register starved i386.
    >>
    >> I still think the right solution eventually is to have a dwarf2 unwinder
    >> by default for i386/x86-64 and get rid of all these nasty "select
    >> FRAME_POINTER"s which have unfortunately sneaked in.

    >
    > No argument on i386 vs frame pointers, fortunately I hardly ever build a
    > 32bit kernel these days, 32bit hardware is disappearing fast here :-)


    It hurts even on 64bit, e.g. the pipe line stall issue is there which
    can be a significant part of a function call cost of a small function
    of which the kernel has plenty (that was the major reason why the x86-64
    ABI discouraged frame pointers BTW and encouraged unwinding). Given several
    modern cores have special hardware to avoid that, but there's still a lot of the
    older cores around and even on the modern cores it's not free.

    Also it's ~10% of the registers there. And it has some other impact.

    > As to merging the dwarf unwinder, I have no particular objection to
    > getting that merged as I'm rather ignorant on these matters - but from
    > reading emails around the last merge attempt, Linus had some strong
    > opinions on the matter.
    >
    > Have those been resolved since?


    I added some checks back then to satisfy Linus. Also the dwarf2 unwinder
    is shipping successfully for some time both in systemtap and in opensuse 11.0,
    [although that one unfortunately fell victim to one of the rogue select
    FRAME_POINTERs, sigh!] so there shouldn't be much teething problems left.

    -Andi

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)

    Peter Zijlstra writes:

    > [...]
    >> Right now x86 doesn't really have a good reliable unwinder that
    >> works without frame pointer. I think systemtap
    >> recently switched to Jan Beulich's dwarf2 unwinder. Before
    >> switching to the in kernel unwinder that one would need to be
    >> re-merged again.

    >
    > Those are two separate issues.
    >
    > 1) stap ought to use the kernel's infrastructure and not re-implement
    > its own.
    > 2) if the kernel's infrastructure doesn't meet requirements, improve
    > it.


    They are related to the extent that readers may not realize some
    implications of systemtap being/becoming a *kernel-resident* but not
    *kernel-focused* tool.

    For example, we're about to do unwinding/stack-traces of userspace
    programs. To what extent do you think the kernel unwinder (should one
    reappear in git) would welcome patches that provide zero benefit to
    the kernel, but only enable a peculiar (nonintrusive) sort of
    unwinding we would need for complex userspace stacks?


    - FChE
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 2 1 2 LastLast